# Install packages
if (!requireNamespace("ggmsa", quietly = TRUE)) {
install.packages("ggmsa")
}
# Library packages
library(ggmsa)Multiple Sequences Alignment
Multiple Sequence Alignment (MSA) is a fundamental and crucial technique in bioinformatics. It is used to align three or more biological sequences (DNA, RNA, or proteins) based on their evolutionary or structural similarities, so that homologous sites (i.e., sites derived from a common ancestor) are aligned as much as possible.
Example

Setup
System Requirements: Cross-platform (Linux/MacOS/Windows)
Programming language: R
Dependent packages:
ggmsa
Data Preparation
DNA/RNA/amino acid sequence data is usually stored in FASTA format, where each sequence consists of a description line (starting with β>β) followed by sequence lines.
# Example data
protein_fasta <- system.file("extdata", "sample.fasta", package = "ggmsa")
# Data preview
seqs <- readLines(protein_fasta)
head(seqs)[1] ">PH4H_Rattus_norvegicus"
[2] "MAAVVLENGVLSRKLSDFGQETSYIEDNSNQNGAISLIFSLKEEVGALAKVLRLFEENDINLTHIESRPSRLNKDEYEFF"
[3] "TYLDKRTKPVLGSIIKSLRNDIGATVHELSRDKEKNTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQ"
[4] "FADIAYNYRHGQPIPRVEYTEEEKQTWGTVFRTLKALYKTHACYEHNHIFPLLEKYCGFREDNIPQLEDVSQFLQTCTGF"
[5] "RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIG-LASLGAPDEYIE"
[6] "KLATIYWFTVEFGLCKEG-DSIKAYGAGLLSSFGELQYCLSD-KPKLLPLELEKTACQEYSVTEFQPLYYVAESFSDAKE"
Visualization
1. Multiple sequence alignment of proteins
The visualization of multiple protein sequence alignment shows the alignment of multiple protein sequences in specific regions, helps identify conserved regions and mutation sites, and provides a professional color scheme for amino acid sequences.
# Multiple sequence alignment of proteins
p <- ggmsa(
protein_fasta,
start = 300,
end = 330,
font = "DroidSansMono",
color = "Chemistry_AA",
char_width = 0.5,
seq_name = TRUE,
consensus_views = FALSE
)
p
2. Multiple sequence alignment and statistics
The statistical chart of multiple sequence alignment results shows the conservation and variation of multiple sequences within the alignment region, helping to identify key sites and regions.
# Multiple sequence alignment and statistics
p <- ggmsa(
protein_fasta,
start = 300,
end = 320,
font = "DroidSansMono",
color = "Chemistry_AA",
char_width = 0.5,
seq_name = TRUE,
consensus_views = FALSE
) +
geom_msaBar()
p
3. Multiple sequence alignment and Logo
While performing multiple sequence alignment and displaying the alignment consensus results with a Logo, the bases or amino acids of multiple sequences are sorted and displayed in the form of a Logo.
# Multiple sequence alignment and Logo
p <- ggmsa(
protein_fasta,
start = 300,
end = 330,
font = "DroidSansMono",
color = "Chemistry_AA",
char_width = 0.5,
seq_name = TRUE,
consensus_views = FALSE
) +
geom_seqlogo()
p
