Multiple Sequence Alignment

Alignments

Alignment

class pytrimal.Alignment

A multiple sequence alignment.

__init__(names, sequences)

Create a new alignment with the given names and sequences.

Parameters:
  • names (Sequence of bytes) – The names of the sequences in the alignment.

  • sequences (Sequence of bytes or str) – The actual sequences in the alignment.

Examples

Create a new alignment with a list of sequences and a list of names:

>>> alignment = Alignment(
...     names=[b"Sp8", b"Sp10", b"Sp26"],
...     sequences=[
...         "-----GLGKVIV-YGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII",
...         "-------DPAVL-FVIMLGTIT-KFS--SEWFFAWLGLEINMMVII",
...         "AAAAAAAAALLTYLGLFLGTDYENFA--AAAANAWLGLEINMMAQI",
...     ]
... )

There should be as many sequences as there are names, otherwise a ValueError will be raised:

>>> Alignment(
...     names=[b"Sp8", b"Sp10", b"Sp26"],
...     sequences=["GLQIHMMGII", "GLEINMMVII"]
... )
Traceback (most recent call last):
...
ValueError: `Alignment` given 3 names but 2 sequences

Sequence characters will be checked, and an error will be raised if they are not one of the characters from a biological alphabet:

>>> Alignment(
...     names=[b"Sp8", b"Sp10"],
...     sequences=["GLQIHMMGII", "GLEINMM123"]
... )
Traceback (most recent call last):
...
ValueError: The sequence "Sp10" has an unknown (49) character
copy()

Create a copy of this alignment.

dump(file, format='fasta')

Dump the alignment to a file or a file-like object.

Parameters:
  • file (str, bytes, os.PathLike or file-like object) – The file to which to write the alignment. If a file-like object is given, it must be open in binary mode. Otherwise, file is treated as a path.

  • format (str) – The name of the alignment format to write. See below for a list of supported formats.

Raises:
  • ValueError – When format is not a recognized file format.

  • OSError – When the path given as file could not be opened.

Hint

The alignment can be written in one of the following formats:

clustal

The alignment format produced by the Clustal and Clustal Omega alignment softwares.

fasta

The aligned FASTA format, which outputs all sequences in the alignment as FASTA records with gap characters (see Wikipedia:FASTA format).

html

An HTML report showing alignment in pseudo-Clustal format with colored residues.

mega

The alignment format produced by the MEGA software for evolutionary analysis of alignments.

nexus

The NEXUS alignment format (see Wikipedia:Nexus file).

phylip (or phylip40):

The PHYLIP 4.0 alignment format.

phylip32

The PHYLIP 3.2 alignment format.

phylippaml

A variant of PHYLIP 4.0 compatible with the PAML tool for phylogenetic analysis.

nbrf or pir

The format of Protein Information Resource database files, provided by the National Biomedical Research Foundation.

Additionally, the fasta, nexus, phylippaml, phylip32, and phylip40 formats support an _m10 variant, which limits the sequence names to 10 characters.

New in version 0.2.2.

dumps(format='fasta', encoding='utf-8')

Dump the alignment to a string in the provided format.

Parameters:
  • format (str) – The format of the alignment. See the dump method for a list of supported formats.

  • encoding (str) – The encoding to use to decode sequence names.

Raises:

ValueError – When format is not a recognized file format.

New in version 0.2.2.

from_biopython(alignment)

Create a new Alignment from an iterable of Biopython records.

Parameters:

alignment (iterable of SeqRecord) – An iterable of Biopython records objects to build the alignment from. Passing a Bio.Align.MultipleSeqAlignment object is also supported.

Returns:

Alignment – A new alignment object ready for trimming.

New in version 0.5.0.

from_pyhmmer(alignment)

Create a new Alignment from a pyhmmer.easel.TextMSA.

Parameters:

alignment (TextMSA) – A PyHMMER object storing a multiple sequence alignment in text format.

Returns:

Alignment – A new alignment object ready for trimming.

New in version 0.5.0.

load(file, format=None)

Load a multiple sequence alignment from a file.

Parameters:
  • path (str, bytes, os.PathLike or file-like object) – The file from which to read the alignment. If a file-like object is given, it must be open in binary mode and support random access with the seek method. Otherwise, file is treated as a path.

  • format (str, optional) – The file-format the alignment is stored in. Must be given when loading from a file-like object, will be autodetected when reading from a file.

Returns:

Alignment – The deserialized alignment.

Example

>>> msa = Alignment.load("example.001.AA.clw")
>>> msa.names
[b'Sp8', b'Sp10', b'Sp26', b'Sp6', b'Sp17', b'Sp33']

Changed in version 0.3.0: Add support for reading code from a file-like object.

to_biopython()

Create a new MultipleSeqAlignment from this Alignment.

Returns:

MultipleSeqAlignment – A multiple sequence alignment object as implemented in Biopython.

Raises:

ImportError – When the Bio module cannot be imported.

New in version 0.5.0.

to_pyhmmer()

Create a new TextMSA from this Alignment.

Returns:

TextMSA – A PyHMMER multiple sequence alignment in text mode.

Raises:

ImportError – When the pyhmmer module cannot be imported.

New in version 0.5.0.

names

The names of the sequences in the alignment.

Type:

sequence of bytes

residues

The residues in the alignment.

Type:

AlignmentResidues

sequences

The sequences in the alignment.

Type:

AlignmentSequences

Trimmed Alignment

class pytrimal.TrimmedAlignment(Alignment)

A multiple sequence alignment that has been trimmed.

Internally, the trimming process produces a mask of sequences and a mask of residues. This class only exposes the filtered sequences and residues.

Example

Create a trimmed alignment using two lists to filter out some residues and sequences:

>>> trimmed = TrimmedAlignment(
...    names=[b"Sp8", b"Sp10", b"Sp26"],
...    sequences=["QFSNWV", "KFS--S", "NFA--A"],
...    sequences_mask=[True, True, False],
...    residues_mask=[True, True, True, False, False, True],
... )

The names and sequences properties will only contain the retained sequences and residues:

>>> list(trimmed.names)
[b'Sp8', b'Sp10']
>>> list(trimmed.sequences)
['QFSV', 'KFSS']

Use the original_alignment method to build the original unfiltered alignment containing all sequences and residues:

>>> ali = trimmed.original_alignment()
>>> list(ali.names)
[b'Sp8', b'Sp10', b'Sp26']
>>> list(ali.sequences)
['QFSNWV', 'KFS--S', 'NFA--A']
__init__(names, sequences, sequences_mask=None, residues_mask=None)

Create a new alignment with the given names, sequences and masks.

Parameters:
  • names (Sequence of bytes) – The names of the sequences in the alignment.

  • sequences (Sequence of str) – The actual sequences in the alignment.

  • sequences_mask (Sequence of bool) – A mask for which sequences to keep in the trimmed alignment. If given, must be as long as the sequences and names list.

  • residues_mask (Sequence of bool) – A mask for which residues to keep in the trimmed alignment. If given, must be as long as every element in the sequences argument.

copy()

Create a copy of this trimmed alignment.

load()

Load a multiple sequence alignment from a file.

Parameters:
  • path (str, bytes, os.PathLike or file-like object) – The file from which to read the alignment. If a file-like object is given, it must be open in binary mode and support random access with the seek method. Otherwise, file is treated as a path.

  • format (str, optional) – The file-format the alignment is stored in. Must be given when loading from a file-like object, will be autodetected when reading from a file.

Returns:

Alignment – The deserialized alignment.

Example

>>> msa = Alignment.load("example.001.AA.clw")
>>> msa.names
[b'Sp8', b'Sp10', b'Sp26', b'Sp6', b'Sp17', b'Sp33']

Changed in version 0.3.0: Add support for reading code from a file-like object.

original_alignment()

Rebuild the original alignment from which this object was obtained.

Returns:

Alignment – The untrimmed alignment that produced this trimmed alignment.

terminal_only()

Get a trimmed alignment where only the terminal residues are removed.

Returns:

TrimmedAlignment – The alignment where only terminal residues have been trimmed.

residues_mask

Which residues are kept in the alignment.

Type:

sequence of bool

sequences_mask

Which sequences are kept in the alignment.

Type:

sequence of bool

Attributes

AlignmentSequences

class pytrimal.AlignmentSequences

A read-only view over the sequences of an alignment.

Objects from this class are created in the sequences property of Alignment objects. Use it to access the string data of individual rows from the alignment:

>>> msa = Alignment.load("example.001.AA.clw")
>>> len(msa.sequences)
6
>>> msa.sequences[0]
'-----GLGKVIV-YGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII'
>>> sum(seq.count('-') for seq in msa.sequences)
43

A slice over a subset of the sequences can be obtained as well without having to copy the internal data, allowing to create a new Alignment with only some sequences from the original one:

>>> msa2 = Alignment(msa.names[:4:2], msa.sequences[:4:2])
>>> len(msa2.sequences)
2
>>> msa2.sequences[1] == msa.sequences[2]
True

New in version 0.4.0: Support for zero-copy slicing.

AlignmentResidues

class pytrimal.AlignmentResidues

A read-only view over the residues of an alignment.

Objects from this class are created in the residues property of Alignment objects. Use it to access the string data of individual columns from the alignment:

>>> msa = Alignment.load("example.001.AA.clw")
>>> len(msa.residues)
46
>>> msa.residues[0]
'--A---'
>>> msa.residues[-1]
'IIIIFL'

New in version 0.4.0: Support for zero-copy slicing.