Sequence Alignment

Aligner

The Aligner toolkit offers a versatile dynamic programming algorithm, from which specific aligning implementations are readily produced. The base class (Aligner) contains a dynamic programming algorithm that can be run in global, semi-global, or local modes. The class has two functions,

.add_target(protein)

.add_template(protein)

which take as arguments the Protein objects from a MolecularSystem object (available in the Python list molecular_system.ProteinList. If only sequence information is available,

.add_target_sequence(string)

.add_template_sequence(string)

are used, each of which take a list of characters or a character string as input. This base class returns a pair of equivalent-length character lists containing the single-letter-residue names and hyphens (-), which represent the alignment’s optimally placed gaps. The Aligner class is initialized as follows:

Aligner.__init__( 1, ”local”, [0.8,0.5])

with a 0 or 1 text-output toggle parameter, an alignment-mode parameter (“local”, “semilocal”, and “global”, representing Needleman-Wunsch, Semiglobal, and Smith-Waterman type algorithms), and a list of two gap values (gap-start and a gap-extention penalties).

Objects are not made by calling the Aligner class directly. Three practical classes have been written that instead inherit and apply this base-class functionality: SequenceAligner, SequenceBasedStructureAligner, and StructureAligner. Class SequenceAligner is described below. StructureAlignment classes are described on another page.

To reuse the dynamic programming algorithm implemented in the Aligner base class, four functions are available for overloading, only one of which is mandatory.

.score() is called in the dynamic programming algorithm code align() in Aligner.py).

The base class score() simply returns a null value, so this function must be called by an inherited class, as in SequenceAligner.py. The score() function always receives the arguments r and m. These are indices to the .residues feature of polymer objects. The are used in score() to gather values for comparison in this scoring function, for instance amino acid type or shielding, as above.

.__init__() may be overloaded as in SequenceAligner to produce a scoring matrix for use in the score() function. There are three arguments. If print_param is set to 1, the alignment is printed; the default is no printing. The alignment_type parameter can be set to "local" (Needleman-Wunsch), "semi-local", and "global" (Smith-Waterman).

A list of two values can be provided as the final argument (gap_scores), for example [2.0, 0.8].

._prepare_polymer() is called in add_target() and add_template() to apply similar code to each polymer as it is registered with the Aligner child class. It is overloaded in StructureAligner to generate its distance lists.

.align() produces no analysis or report, these should be implemented in a child class function that calls the parent align() function, then analyzes the results, as in each of the example children classes.

SequenceAligner

Class SequenceAligner inherits the dynamic programming algorithm from class Aligner, and uses the PET91 scoring matrix to align the residues, and reports the percent identity between the aligned sequences.

Create an instance of the SequenceAligner class, then load using the add_target() and add_template() functions. There are three different output functions, each of which internally call the Aligner .align() function. Function

.align_sequences() -- performs the alignment and returns the percent identity with a return value of zero to 1.

.get_alignment() -- prints the percent identity, then returns two lists of equal size, containing pointers to the AminoAcid objects that comprise the alignment, spaced with '-' characters to represent gaps.

.get_pid_and_length() -- performs the alignment and returns two values; the fraction of identical residues and the length of the alignment, calculated as the number of non-gap residue pairs in the alignment.

Google Sites

Report abuse