Trimming an alignment to build an HMM¶
In this example, we will use pytrimal
to trim an alignment, and create a profile HMM in HMMER format using pyhmmer
. The data is an alignment made with MUSCLE of all the halorhodopsins found in UniProt.
[1]:
import pytrimal
pytrimal.__version__
[1]:
'0.8.0-alpha1'
[2]:
import pyhmmer
pyhmmer.__version__
[2]:
'0.10.12'
Creating the alignment¶
To demonstrate how to create an Alignment
with the Python API rather than loading it from a file, we will use a pyhmmer.easel.MSAFile
to load the source alignment.
[3]:
import pathlib
with pyhmmer.easel.MSAFile(pathlib.Path("data").joinpath("halorhodopsin.afa")) as msa_file:
msa = msa_file.read()
msa.name = b"halorhodopsin"
Now, create the alignment by passing it a list of sequence names and text. Sequence names must be bytes
because they may contain special characters; it is up to you to handle the encoding and decoding. However, sequences can be given as strings, because all characters are checked, and biological characters are always in the ASCII range. Luckily, TextMSA
objects have handy attributes to get the names and the sequences, so conversion is straightforward
[4]:
alignment = pytrimal.Alignment(names=msa.names, sequences=msa.alignment)
Using Matplotlib, we can visualize the number of gaps at a given position in the alignment:
[5]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 4))
plt.bar(range(1, len(alignment.residues)+1), [col.count('-') for col in alignment.residues])
plt.xlabel("Position in sequence")
plt.ylabel("Number of gaps")
plt.ylim(0, len(alignment.sequences))
plt.show()
/home/docs/checkouts/readthedocs.org/user_builds/pytrimal/envs/latest/lib/python3.11/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/docs/checkouts/readthedocs.org/user_builds/pytrimal/envs/latest/lib/python3.11/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
Trimming in automatic mode¶
We can see that the source alignment is heavily gapped, probably because there are one or two heavily diverged sequences in the input. Let’s use fully automated trimming, and see how well this removes the gaps:
[6]:
trimmer = pytrimal.AutomaticTrimmer(method="automated1")
trimmed = trimmer.trim(alignment)
Now, let’s visualize how well the trimming worked. For this, we just plot the background of the graph in grey for regions that have been trimmed out:
[7]:
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 4))
X = range(1, len(alignment.residues)+1)
Y = [col.count('-') if mask else 0 for col, mask in zip(alignment.residues, trimmed.residues_mask)]
mask = [len(alignment.sequences) if not mask else 0 for mask in trimmed.residues_mask]
plt.bar(X, mask, color="gray", alpha=0.5, width=1)
plt.bar(X, Y)
plt.xlabel("Position in sequence")
plt.ylabel("Number of gaps")
plt.ylim(0, len(alignment.sequences))
plt.show()
We can check the residues
attribute to see how many columns remain in the alignment:
[8]:
len(trimmed.residues)
[8]:
194
Building a HMM from the trimmed alignment¶
We will now build a HMM with pyHMMER. For more information, see the example in the pyHMMER documentation. First, convert the trimmed alignment back to a pyhmmer.easel.TextMSA
, which is easily done with the Alignment.to_pyhmmer
method:
[9]:
trimmed_msa = trimmed.to_pyhmmer()
trimmed_msa.name = b"halorhodopsin-trimmed"
To build the HMM, we need a Builder
and a Background
model from pyhmmer
:
[10]:
alphabet = pyhmmer.easel.Alphabet.amino()
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
Now we can build the HMMs from the multiple sequence alignment using the Builder.build_msa
method. We build two HMMs, one for the trimmed alignment, and one for the full alignment, to compare them later.
[11]:
hmm, _, _ = builder.build_msa(msa.digitize(alphabet), background)
hmm_trimmed, _, _ = builder.build_msa(trimmed_msa.digitize(alphabet), background)
Compare the HMMs¶
Trimming the input alignment should have removed the least informative columns. To see the effect on the HMM, let’s build a table of statistics comparing the HMM built on the full alignment, and the one built on the trimmed alignment:
[12]:
import rich.table
table = rich.table.Table()
table.add_column("Statistic")
table.add_column("HMM (Full)")
table.add_column("HMM (Trimmed)")
table.add_row("Number of nodes", repr(hmm.M), repr(hmm_trimmed.M))
table.add_row("Effective sequence count", repr(hmm.nseq_effective), repr(hmm_trimmed.nseq_effective))
table.add_row("Mean match information", repr(hmm.mean_match_information(background)), repr(hmm_trimmed.mean_match_information(background)))
table.add_row("Mean match relative entropy", repr(hmm.mean_match_relative_entropy(background)), repr(hmm_trimmed.mean_match_relative_entropy(background)))
rich.print(table)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃ Statistic ┃ HMM (Full) ┃ HMM (Trimmed) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ Number of nodes │ 394 │ 194 │ │ Effective sequence count │ 2.3590087890625 │ 1.6571044921875 │ │ Mean match information │ 0.6031365748589415 │ 0.6095076793247891 │ │ Mean match relative entropy │ 0.5895090606820008 │ 0.5899250015770037 │ └─────────────────────────────┴────────────────────┴────────────────────┘
Our initial alignment was over 1000 residues long, but even the full HMM has a much smaller number of nodes, which means it defined more than 600 gap-containing alignment columns to be insertions relative to consensus. On the contrary the trimmed HMM used all the columns from the trimmed alignment as match states. The effective sequence count still decreased in the trimmed HMM, but the mean match information and relative entropy was not affected.
To visualize the influence of the gaps, we can plot the different transition probabilities for every node of the HMM:
[13]:
plt.figure(figsize=(16, 8))
labels = [
r"$M_n \to M_{n+1}$",
r"$M_n \to I_{n+1}$",
r"$M_n \to D_{n+1}$",
r"$I_n \to M_{n+1}$",
r"$I_n \to I_{n+1}$",
r"$D_n \to M_{n+1}$",
r"$D_n \to D_{n+1}$",
]
plt.subplot(2, 1, 1)
plt.ylabel("Transition probability")
plt.xlim(0, max(hmm.M, hmm_trimmed.M))
for i in range(7):
plt.plot([ row[i] for row in hmm.transition_probabilities ])
plt.subplot(2, 1, 2)
plt.xlim(0, max(hmm.M, hmm_trimmed.M))
plt.ylabel("Transition probability")
for i in range(7):
plt.plot([ row[i] for row in hmm_trimmed.transition_probabilities ], label=labels[i])
plt.xlabel("HMM node")
plt.legend()
plt.show()
It’s immediately obvious that because of the gaps in the original alignment, the full HMM tends to expect much more insertions at certain positions. It also exhbibits several positions (e.g. around 40 or 175) where the full HMM model in a match state has higher probability to go both in an insertion or a deletion state, which is incoherent. In comparison, the transition probabilities for the trimmed HMM are smoother.