Multiple Sequence Alignment#
- class pytrimal.Alignment#
A multiple sequence alignment.
- __init__(names, sequences)#
Create a new alignment with the given names and sequences.
- Parameters:
names (
Sequenceofbytes) – The names of the sequences in the alignment.sequences (
Sequenceofbytesorstr) – The actual sequences in the alignment.sequence_type (
strorNone) – The type of sequences stored in the alignment, one ofprotein,dnaorrna. IfNonegiven, use the trimAl strategy to auto-detect the type. This is used in some trimmer statistics to correctly identify indetermintation symbols (Xfor protein sequences,Nfor DNA and RNA sequences).
Examples
Create a new alignment with a list of sequences and a list of names:
>>> alignment = Alignment( ... names=[b"Sp8", b"Sp10", b"Sp26"], ... sequences=[ ... "-----GLGKVIV-YGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII", ... "-------DPAVL-FVIMLGTIT-KFS--SEWFFAWLGLEINMMVII", ... "AAAAAAAAALLTYLGLFLGTDYENFA--AAAANAWLGLEINMMAQI", ... ] ... )
There should be as many sequences as there are names, otherwise a
ValueErrorwill be raised:>>> Alignment( ... names=[b"Sp8", b"Sp10", b"Sp26"], ... sequences=["GLQIHMMGII", "GLEINMMVII"] ... ) Traceback (most recent call last): ... ValueError: `Alignment` given 3 names but 2 sequences
Sequence characters will be checked, and an error will be raised if they are not one of the characters from a biological alphabet:
>>> Alignment( ... names=[b"Sp8", b"Sp10"], ... sequences=["GLQIHMMGII", "GLEINMM123"] ... ) Traceback (most recent call last): ... ValueError: The sequence "Sp10" has an unknown (49) character
Added in version 0.9.0: The
sequence_typeargument.
- copy()#
Create a copy of this alignment.
- dump()#
Dump the alignment to a file or a file-like object.
- Parameters:
file (
str,bytes,os.PathLikeor file-like object) – The file to which to write the alignment. If a file-like object is given, it must be open in binary mode. Otherwise,fileis treated as a path.format (
str) – The name of the alignment format to write. See below for a list of supported formats.
- Raises:
ValueError – When
formatis not a recognized file format.OSError – When the path given as
filecould not be opened.
Hint
The alignment can be written in one of the following formats:
clustalThe alignment format produced by the Clustal and Clustal Omega alignment softwares.
fastaThe aligned FASTA format, which outputs all sequences in the alignment as FASTA records with gap characters (see FASTA format).
htmlAn HTML report showing alignment in pseudo-Clustal format with colored residues.
megaThe alignment format produced by the MEGA software for evolutionary analysis of alignments.
nexusThe NEXUS alignment format (see Nexus file).
phylip(orphylip40):The PHYLIP 4.0 alignment format.
phylip32The PHYLIP 3.2 alignment format.
phylippamlA variant of PHYLIP 4.0 compatible with the PAML tool for phylogenetic analysis.
nbrforpirThe format of Protein Information Resource database files, provided by the National Biomedical Research Foundation.
Additionally, the
fasta,nexus,phylippaml,phylip32, andphylip40formats support an_m10variant, which limits the sequence names to 10 characters.Added in version 0.2.2.
- dumps()#
Dump the alignment to a string in the provided format.
- Parameters:
- Raises:
ValueError – When
formatis not a recognized file format.
Added in version 0.2.2.
- from_biopython()#
Create a new
Alignmentfrom an iterable of Biopython records.- Parameters:
alignment (iterable of
SeqRecord) – An iterable of Biopython records objects to build the alignment from. Passing aBio.Align.MultipleSeqAlignmentobject is also supported.- Returns:
Alignment– A new alignment object ready for trimming.
Added in version 0.5.0.
- from_pyhmmer()#
Create a new
Alignmentfrom apyhmmer.easel.TextMSA.- Parameters:
alignment (
TextMSA) – A PyHMMER object storing a multiple sequence alignment in text format.- Returns:
Alignment– A new alignment object ready for trimming.
Added in version 0.5.0.
- load()#
Load a multiple sequence alignment from a file.
- Parameters:
path (
str,bytes,os.PathLikeor file-like object) – The file from which to read the alignment. If a file-like object is given, it must be open in binary mode and support random access with theseekmethod. Otherwise,fileis treated as a path.format (
str, optional) – The file-format the alignment is stored in. Must be given when loading from a file-like object, will be autodetected when reading from a file.
- Returns:
Alignment– The deserialized alignment.
Example
>>> msa = Alignment.load("example.001.AA.clw") >>> msa.names [b'Sp8', b'Sp10', b'Sp26', b'Sp6', b'Sp17', b'Sp33']
Changed in version 0.3.0: Add support for reading code from a file-like object.
- to_biopython()#
Create a new
MultipleSeqAlignmentfrom thisAlignment.- Returns:
MultipleSeqAlignment– A multiple sequence alignment object as implemented in Biopython.- Raises:
ImportError – When the
Biomodule cannot be imported.
Added in version 0.5.0.
- to_pyhmmer()#
Create a new
TextMSAfrom thisAlignment.- Returns:
TextMSA– A PyHMMER multiple sequence alignment in text mode.- Raises:
ImportError – When the
pyhmmermodule cannot be imported.
Added in version 0.5.0.
- residues#
The residues in the alignment.
- Type:
- sequence_type#
The type of sequences in the alignment.
Usually one of
protein,dnaorrna; in the case where the trimAl auto-detection failed, this will beNone. This field is used in some trimmer statistics to correctly identify indetermintation symbols (Xfor protein sequences,Nfor DNA and RNA sequences).Example
>>> alignment = Alignment( ... names=[b"Sp8", b"Sp10", b"Sp26"], ... sequences=[ ... "-----GLGKVIV-YGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII", ... "-------DPAVL-FVIMLGTIT-KFS--SEWFFAWLGLEINMMVII", ... "AAAAAAAAALLTYLGLFLGTDYENFA--AAAANAWLGLEINMMAQI", ... ] ... ) >>> alignment.sequence_type 'protein'
Added in version 0.9.0.
- sequences#
The sequences in the alignment.
- Type:
- class pytrimal.TrimmedAlignment(Alignment)#
A multiple sequence alignment that has been trimmed.
Internally, the trimming process produces a mask of sequences and a mask of residues. This class only exposes the filtered sequences and residues.
Example
Create a trimmed alignment using two lists to filter out some residues and sequences:
>>> trimmed = TrimmedAlignment( ... names=[b"Sp8", b"Sp10", b"Sp26"], ... sequences=["QFSNWV", "KFS--S", "NFA--A"], ... sequences_mask=[True, True, False], ... residues_mask=[True, True, True, False, False, True], ... )
The
namesandsequencesproperties will only contain the retained sequences and residues:>>> list(trimmed.names) [b'Sp8', b'Sp10'] >>> list(trimmed.sequences) ['QFSV', 'KFSS']
Use the
original_alignmentmethod to build the original unfiltered alignment containing all sequences and residues:>>> ali = trimmed.original_alignment() >>> list(ali.names) [b'Sp8', b'Sp10', b'Sp26'] >>> list(ali.sequences) ['QFSNWV', 'KFS--S', 'NFA--A']
- __init__(names, sequences, sequences_mask=None, residues_mask=None)#
Create a new alignment with the given names, sequences and masks.
- Parameters:
names (
Sequenceofbytes) – The names of the sequences in the alignment.sequences (
Sequenceofstr) – The actual sequences in the alignment.sequences_mask (
Sequenceofbool) – A mask for which sequences to keep in the trimmed alignment. If given, must be as long as thesequencesandnameslist.residues_mask (
Sequenceofbool) – A mask for which residues to keep in the trimmed alignment. If given, must be as long as every element in thesequencesargument.
- copy()#
Create a copy of this trimmed alignment.
- load()#
Load a multiple sequence alignment from a file.
- Parameters:
path (
str,bytes,os.PathLikeor file-like object) – The file from which to read the alignment. If a file-like object is given, it must be open in binary mode and support random access with theseekmethod. Otherwise,fileis treated as a path.format (
str, optional) – The file-format the alignment is stored in. Must be given when loading from a file-like object, will be autodetected when reading from a file.
- Returns:
Alignment– The deserialized alignment.
Example
>>> msa = Alignment.load("example.001.AA.clw") >>> msa.names [b'Sp8', b'Sp10', b'Sp26', b'Sp6', b'Sp17', b'Sp33']
Changed in version 0.3.0: Add support for reading code from a file-like object.
- original_alignment()#
Rebuild the original alignment from which this object was obtained.
- Returns:
Alignment– The untrimmed alignment that produced this trimmed alignment.
- terminal_only()#
Get a trimmed alignment where only the terminal residues are removed.
- Returns:
TrimmedAlignment– The alignment where only terminal residues have been trimmed.
- class pytrimal.AlignmentSequences#
A read-only view over the sequences of an alignment.
Objects from this class are created in the
sequencesproperty ofAlignmentobjects. Use it to access the string data of individual rows from the alignment:>>> msa = Alignment.load("example.001.AA.clw") >>> len(msa.sequences) 6 >>> msa.sequences[0] '-----GLGKVIV-YGIVLGTKSDQFSNWVVWLFPWNGLQIHMMGII' >>> sum(seq.count('-') for seq in msa.sequences) 43
A slice over a subset of the sequences can be obtained as well without having to copy the internal data, allowing to create a new
Alignmentwith only some sequences from the original one:>>> msa2 = Alignment(msa.names[:4:2], msa.sequences[:4:2]) >>> len(msa2.sequences) 2 >>> msa2.sequences[1] == msa.sequences[2] True
Added in version 0.4.0: Support for zero-copy slicing.
- __getitem__(key, /)#
Return self[key].
- __len__()#
Return len(self).
- class pytrimal.AlignmentResidues#
A read-only view over the residues of an alignment.
Objects from this class are created in the
residuesproperty ofAlignmentobjects. Use it to access the string data of individual columns from the alignment:>>> msa = Alignment.load("example.001.AA.clw") >>> len(msa.residues) 46 >>> msa.residues[0] '--A---' >>> msa.residues[-1] 'IIIIFL'
Added in version 0.4.0: Support for zero-copy slicing.
- __getitem__(key, /)#
Return self[key].
- __len__()#
Return len(self).