{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "64109ae3-7e25-42d8-be70-ad3f4d2cf426",
   "metadata": {},
   "source": [
    "# Trimming an alignment to build an HMM\n",
    "\n",
    "In this example, we will use `pytrimal` to trim an alignment, and create a profile HMM in HMMER format using `pyhmmer`. The data is an alignment made with [MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/) of all the [halorhodopsins](https://en.wikipedia.org/wiki/Halorhodopsin) found in [UniProt](https://www.uniprot.org/uniprot/?query=halorhodopsin&sort=score)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1bcd745-2fe2-4e49-a911-2fee31570696",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pytrimal\n",
    "pytrimal.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5ae5d7c-8122-4aeb-a45b-ef64090f0fac",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyhmmer\n",
    "pyhmmer.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6d101b5-c125-4e31-ba73-affa7873dbef",
   "metadata": {},
   "source": [
    "## Creating the alignment\n",
    "\n",
    "To demonstrate how to create an `Alignment` with the Python API rather than loading it from a file, we will use a `pyhmmer.easel.MSAFile` to load the source alignment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "474eac4d-0b25-48e8-885e-943afd489ae0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "with pyhmmer.easel.MSAFile(pathlib.Path(\"data\").joinpath(\"halorhodopsin.afa\")) as msa_file:\n",
    "    msa = msa_file.read()\n",
    "    msa.name = b\"halorhodopsin\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4ff2610-7ee1-44e4-9ac9-6b4c431b415f",
   "metadata": {},
   "source": [
    "Now, create the alignment by passing it a list of sequence names and text. Sequence names must be `bytes` because they may contain special characters; it is up to you to handle the encoding and decoding. However, sequences can be given as strings, because all characters are checked, and biological characters are always in the ASCII range. Luckily, `TextMSA` objects have handy attributes to get the names and the sequences, so conversion is straightforward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69371140-0e6b-4845-9db1-f6e54342688a",
   "metadata": {},
   "outputs": [],
   "source": [
    "alignment = pytrimal.Alignment(names=msa.names, sequences=msa.alignment)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "336aa1b6-2d4e-4278-833e-3818df072aba",
   "metadata": {},
   "source": [
    "Using [Matplotlib](https://matplotlib.org/), we can visualize the number of gaps at a given position in the alignment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6deb7195-4825-45eb-9391-3c9e6f5d62ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.figure(figsize=(16, 4))\n",
    "plt.bar(range(1, len(alignment.residues)+1), [col.count('-') for col in alignment.residues])\n",
    "plt.xlabel(\"Position in sequence\")\n",
    "plt.ylabel(\"Number of gaps\")\n",
    "plt.ylim(0, len(alignment.sequences))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f76f722-d769-41ac-8c5f-43cf2e64ce22",
   "metadata": {},
   "source": [
    "## Trimming in automatic mode"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7abf6609-af4b-4175-a378-20a7417b45a9",
   "metadata": {},
   "source": [
    "We can see that the source alignment is heavily gapped, probably because there are one or two heavily diverged sequences in the input. Let's use fully automated trimming, and see how well this removes the gaps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25a0c7a1-09fd-485c-a6a6-f717885f9279",
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmer = pytrimal.AutomaticTrimmer(method=\"automated1\")\n",
    "trimmed = trimmer.trim(alignment)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11bed682-a3ae-46e1-a80a-eba4977bfcf0",
   "metadata": {},
   "source": [
    "Now, let's visualize how well the trimming worked. For this, we just plot the background of the graph in grey for regions that have been trimmed out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72f462f6-e800-405e-9f21-e51191799946",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.figure(figsize=(16, 4))\n",
    "X = range(1, len(alignment.residues)+1)\n",
    "Y = [col.count('-') if mask else 0 for col, mask in zip(alignment.residues, trimmed.residues_mask)]\n",
    "mask = [len(alignment.sequences) if not mask else 0 for mask in trimmed.residues_mask]\n",
    "plt.bar(X, mask, color=\"gray\", alpha=0.5, width=1)\n",
    "plt.bar(X, Y)\n",
    "plt.xlabel(\"Position in sequence\")\n",
    "plt.ylabel(\"Number of gaps\")\n",
    "plt.ylim(0, len(alignment.sequences))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88698d6c-5fe8-4c61-9841-4641c0029da3",
   "metadata": {},
   "source": [
    "We can check the `residues` attribute to see how many columns remain in the alignment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc7fdd58-4b4c-494f-9286-0d97bca748f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(trimmed.residues)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddcc41cc-e6a5-4379-9c8c-35fbf3bd3bab",
   "metadata": {},
   "source": [
    "## Building a HMM from the trimmed alignment\n",
    "\n",
    "We will now build a HMM with [pyHMMER](https://pyhmmer.readthedocs.io). For more information, see the [example in the pyHMMER documentation](https://pyhmmer.readthedocs.io/en/stable/examples/msa_to_hmm.html). First, convert the trimmed alignment back to a `pyhmmer.easel.TextMSA`, which is easily done with the `Alignment.to_pyhmmer` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e2ac466-c6db-469e-97fd-1e7db6d205d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_msa = trimmed.to_pyhmmer()\n",
    "trimmed_msa.name = b\"halorhodopsin-trimmed\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "679f1e14-864c-442b-9b46-3afd94f209f9",
   "metadata": {},
   "source": [
    "To build the HMM, we need a `Builder` and a `Background` model from `pyhmmer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2b85e8a-e363-4331-b389-fe7b8a3e1b6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "alphabet = pyhmmer.easel.Alphabet.amino()\n",
    "builder = pyhmmer.plan7.Builder(alphabet)\n",
    "background = pyhmmer.plan7.Background(alphabet)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c47471b-b30e-4995-9dd3-5760b0875ff9",
   "metadata": {},
   "source": [
    "Now we can build the HMMs from the multiple sequence alignment using the `Builder.build_msa` method. We build two HMMs, one for the trimmed alignment, and one for the full alignment, to compare them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86187ea5-d015-46d6-91dd-0ce3826d5124",
   "metadata": {},
   "outputs": [],
   "source": [
    "hmm, _, _ = builder.build_msa(msa.digitize(alphabet), background)\n",
    "hmm_trimmed, _, _ = builder.build_msa(trimmed_msa.digitize(alphabet), background)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "363cfa5b-629b-4db7-ab48-430b1837cfc1",
   "metadata": {},
   "source": [
    "## Compare the HMMs\n",
    "\n",
    "Trimming the input alignment should have removed the least informative columns. To see the effect on the HMM, let's build a table of statistics comparing the HMM built on the full alignment, and the one built on the trimmed alignment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70c3244e-b212-456a-9a32-dbcc06bf140e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import rich.table\n",
    "\n",
    "table = rich.table.Table()\n",
    "table.add_column(\"Statistic\")\n",
    "table.add_column(\"HMM (Full)\")\n",
    "table.add_column(\"HMM (Trimmed)\")\n",
    "\n",
    "table.add_row(\"Number of nodes\", repr(hmm.M), repr(hmm_trimmed.M))\n",
    "table.add_row(\"Effective sequence count\", repr(hmm.nseq_effective), repr(hmm_trimmed.nseq_effective))\n",
    "table.add_row(\"Mean match information\", repr(hmm.mean_match_information(background)), repr(hmm_trimmed.mean_match_information(background)))\n",
    "table.add_row(\"Mean match relative entropy\", repr(hmm.mean_match_relative_entropy(background)), repr(hmm_trimmed.mean_match_relative_entropy(background)))\n",
    "\n",
    "rich.print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c9e3a2-def1-48ad-a9c2-eb6ede1cc173",
   "metadata": {},
   "source": [
    "Our initial alignment was over 1000 residues long, but even the full HMM has a much smaller number of nodes, which means it defined more than 600 gap-containing alignment columns to be insertions relative to consensus. On the contrary the trimmed HMM used all the columns from the trimmed alignment as match states. The effective sequence count still decreased in the trimmed HMM, but the mean match information and relative entropy was not affected.\n",
    "\n",
    "To visualize the influence of the gaps, we can plot the different transition probabilities for every node of the HMM:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "279366b1-951b-454d-ad8c-e87e33d879ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(16, 8))\n",
    "\n",
    "labels = [\n",
    "    r\"$M_n \\to M_{n+1}$\", \n",
    "    r\"$M_n \\to I_{n+1}$\", \n",
    "    r\"$M_n \\to D_{n+1}$\", \n",
    "    r\"$I_n \\to M_{n+1}$\",\n",
    "    r\"$I_n \\to I_{n+1}$\",\n",
    "    r\"$D_n \\to M_{n+1}$\",\n",
    "    r\"$D_n \\to D_{n+1}$\",\n",
    "]\n",
    "\n",
    "plt.subplot(2, 1, 1)\n",
    "plt.ylabel(\"Transition probability\")\n",
    "plt.xlim(0, max(hmm.M, hmm_trimmed.M))\n",
    "for i in range(7):\n",
    "    plt.plot([ row[i] for row in hmm.transition_probabilities ])\n",
    "\n",
    "plt.subplot(2, 1, 2)\n",
    "plt.xlim(0, max(hmm.M, hmm_trimmed.M))\n",
    "plt.ylabel(\"Transition probability\")\n",
    "for i in range(7):\n",
    "    plt.plot([ row[i] for row in hmm_trimmed.transition_probabilities ], label=labels[i])\n",
    "    \n",
    "plt.xlabel(\"HMM node\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "988f4928-8864-4f47-be8d-2ad420dbd947",
   "metadata": {},
   "source": [
    "It's immediately obvious that because of the gaps in the original alignment, the full HMM tends to expect much more insertions at certain positions. It also exhbibits several positions (e.g. around 40 or 175) where the full HMM model in a match state has higher probability to go both in an insertion or a deletion state, which is incoherent. In comparison, the transition probabilities for the trimmed HMM are smoother."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}