Trimming an alignment to build an HMM

In this example, we will use pytrimal to trim an alignment, and create a profile HMM in HMMER format using pyhmmer. The data is an alignment made with MUSCLE of all the halorhodopsins found in UniProt.

import pytrimal
import pyhmmer

Creating the alignment

To demonstrate how to create an Alignment with the Python API rather than loading it from a file, we will use a pyhmmer.easel.MSAFile to load the source alignment.

import pathlib
with pyhmmer.easel.MSAFile(pathlib.Path("data").joinpath("halorhodopsin.afa")) as msa_file:
    msa = = b"halorhodopsin"

Now, create the alignment by passing it a list of sequence names and text. Sequence names must be bytes because they may contain special characters; it is up to you to handle the encoding and decoding. However, sequences can be given as strings, because all characters are checked, and biological characters are always in the ASCII range. Luckily, TextMSA objects have handy attributes to get the names and the sequences, so conversion is straightforward

alignment = pytrimal.Alignment(names=msa.names, sequences=msa.alignment)

Using Matplotlib, we can visualize the number of gaps at a given position in the alignment:

import matplotlib.pyplot as plt
plt.figure(figsize=(16, 4)), len(alignment.residues)+1), [col.count('-') for col in alignment.residues])
plt.xlabel("Position in sequence")
plt.ylabel("Number of gaps")
plt.ylim(0, len(alignment.sequences))
/home/docs/checkouts/ FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.

Trimming in automatic mode

We can see that the source alignment is heavily gapped, probably because there are one or two heavily diverged sequences in the input. Let’s use fully automated trimming, and see how well this removes the gaps:

trimmer = pytrimal.AutomaticTrimmer(method="automated1")
trimmed = trimmer.trim(alignment)

Now, let’s visualize how well the trimming worked. For this, we just plot the background of the graph in grey for regions that have been trimmed out:

import matplotlib.pyplot as plt
plt.figure(figsize=(16, 4))
X = range(1, len(alignment.residues)+1)
Y = [col.count('-') if mask else 0 for col, mask in zip(alignment.residues, trimmed.residues_mask)]
mask = [len(alignment.sequences) if not mask else 0 for mask in trimmed.residues_mask], mask, color="gray", alpha=0.5, width=1), Y)
plt.xlabel("Position in sequence")
plt.ylabel("Number of gaps")
plt.ylim(0, len(alignment.sequences))

We can check the residues attribute to see how many columns remain in the alignment:


Building a HMM from the trimmed alignment

We will now build a HMM with pyHMMER. For more information, see the example in the pyHMMER documentation. First, convert the trimmed alignment back to a pyhmmer.easel.TextMSA, which is easily done with the Alignment.to_pyhmmer method:

trimmed_msa = trimmed.to_pyhmmer() = b"halorhodopsin-trimmed"

To build the HMM, we need a Builder and a Background model from pyhmmer:

alphabet = pyhmmer.easel.Alphabet.amino()
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)

Now we can build the HMMs from the multiple sequence alignment using the Builder.build_msa method. We build two HMMs, one for the trimmed alignment, and one for the full alignment, to compare them later.

hmm, _, _ = builder.build_msa(msa.digitize(alphabet), background)
hmm_trimmed, _, _ = builder.build_msa(trimmed_msa.digitize(alphabet), background)

Compare the HMMs

Trimming the input alignment should have removed the least informative columns. To see the effect on the HMM, let’s build a table of statistics comparing the HMM built on the full alignment, and the one built on the trimmed alignment:

import rich.table

table = rich.table.Table()
table.add_column("HMM (Full)")
table.add_column("HMM (Trimmed)")

table.add_row("Number of nodes", repr(hmm.M), repr(hmm_trimmed.M))
table.add_row("Effective sequence count", repr(hmm.nseq_effective), repr(hmm_trimmed.nseq_effective))
table.add_row("Mean match information", repr(hmm.mean_match_information(background)), repr(hmm_trimmed.mean_match_information(background)))
table.add_row("Mean match relative entropy", repr(hmm.mean_match_relative_entropy(background)), repr(hmm_trimmed.mean_match_relative_entropy(background)))

┃ Statistic                    HMM (Full)          HMM (Trimmed)      ┃
│ Number of nodes             │ 394                │ 189                │
│ Effective sequence count    │ 2.3590087890625    │ 1.6510009765625    │
│ Mean match information      │ 0.6031365748589415 │ 0.609191719817106  │
│ Mean match relative entropy │ 0.5895090606820008 │ 0.5900501518574341 │

Our initial alignment was over 1000 residues long, but even the full HMM has a much smaller number of nodes, which means it defined more than 600 gap-containing alignment columns to be insertions relative to consensus. On the contrary the trimmed HMM used all the columns from the trimmed alignment as match states. The effective sequence count still decreased in the trimmed HMM, but the mean match information and relative entropy was not affected.

To visualize the influence of the gaps, we can plot the different transition probabilities for every node of the HMM:

plt.figure(figsize=(16, 8))

labels = [
    r"$M_n \to M_{n+1}$",
    r"$M_n \to I_{n+1}$",
    r"$M_n \to D_{n+1}$",
    r"$I_n \to M_{n+1}$",
    r"$I_n \to I_{n+1}$",
    r"$D_n \to M_{n+1}$",
    r"$D_n \to D_{n+1}$",

plt.subplot(2, 1, 1)
plt.ylabel("Transition probability")
plt.xlim(0, max(hmm.M, hmm_trimmed.M))
for i in range(7):
    plt.plot([ row[i] for row in hmm.transition_probabilities ])

plt.subplot(2, 1, 2)
plt.xlim(0, max(hmm.M, hmm_trimmed.M))
plt.ylabel("Transition probability")
for i in range(7):
    plt.plot([ row[i] for row in hmm_trimmed.transition_probabilities ], label=labels[i])

plt.xlabel("HMM node")

It’s immediately obvious that because of the gaps in the original alignment, the full HMM tends to expect much more insertions at certain positions. It also exhbibits several positions (e.g. around 40 or 175) where the full HMM model in a match state has higher probability to go both in an insertion or a deletion state, which is incoherent. In comparison, the transition probabilities for the trimmed HMM are smoother.