Multiple Sequences Alignment

Author

[Editor] Benben Miao

Multiple Sequence Alignment (MSA) is a fundamental and crucial technique in bioinformatics. It is used to align three or more biological sequences (DNA, RNA, or proteins) based on their evolutionary or structural similarities, so that homologous sites (i.e., sites derived from a common ancestor) are aligned as much as possible.

Example

MultiSeqsAlignment

Setup

  • System Requirements: Cross-platform (Linux/MacOS/Windows)

  • Programming language: R

  • Dependent packages: ggmsa

# Install packages
if (!requireNamespace("ggmsa", quietly = TRUE)) {
  install.packages("ggmsa")
}

# Library packages
library(ggmsa)

Data Preparation

DNA/RNA/amino acid sequence data is usually stored in FASTA format, where each sequence consists of a description line (starting with β€œ>”) followed by sequence lines.

# Example data
protein_fasta <- system.file("extdata", "sample.fasta", package = "ggmsa")

# Data preview
seqs <- readLines(protein_fasta)
head(seqs)
[1] ">PH4H_Rattus_norvegicus"                                                         
[2] "MAAVVLENGVLSRKLSDFGQETSYIEDNSNQNGAISLIFSLKEEVGALAKVLRLFEENDINLTHIESRPSRLNKDEYEFF"
[3] "TYLDKRTKPVLGSIIKSLRNDIGATVHELSRDKEKNTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ"
[4] "FADIAYNYRHGQPIPRVEYTEEEKQTWGTVFRTLKALYKTHACYEHNHIFPLLEKYCGFREDNIPQLEDVSQFLQTCTGF"
[5] "RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIG-LASLGAPDEYIE"
[6] "KLATIYWFTVEFGLCKEG-DSIKAYGAGLLSSFGELQYCLSD-KPKLLPLELEKTACQEYSVTEFQPLYYVAESFSDAKE"

Visualization

1. Multiple sequence alignment of proteins

The visualization of multiple protein sequence alignment shows the alignment of multiple protein sequences in specific regions, helps identify conserved regions and mutation sites, and provides a professional color scheme for amino acid sequences.

# Multiple sequence alignment of proteins
p <- ggmsa(
  protein_fasta,
  start = 300,
  end = 330,
  font = "DroidSansMono",
  color = "Chemistry_AA",
  char_width = 0.5,
  seq_name = TRUE,
  consensus_views = FALSE
)

p
FigureΒ 1: Multiple sequence alignment of proteins

2. Multiple sequence alignment and statistics

The statistical chart of multiple sequence alignment results shows the conservation and variation of multiple sequences within the alignment region, helping to identify key sites and regions.

# Multiple sequence alignment and statistics
p <- ggmsa(
  protein_fasta,
  start = 300,
  end = 320,
  font = "DroidSansMono",
  color = "Chemistry_AA",
  char_width = 0.5,
  seq_name = TRUE,
  consensus_views = FALSE
) +
  geom_msaBar()

p
FigureΒ 2: Multiple sequence alignment and statistics