Picking up from where we left in Sequence Alignment and Comparisons, in this class we will be discussing ways to perform fast pattern matching for:
a. long patterns (i.e. not motifs)
b. against longer sequences (i.e. large sequence chunks, scaffolds or even entire chromosomes)
c. for large numbers of patterns (i.e. not one or two but of the orders of hundreds of sequences)
We will look into the main algorithms used for pattern matching such as the Knuth-Morris-Pratt (KMP) and the Boyer Moore Algorithms as general algorithms before moving on to biology-inspired algorithms for fast searches such as FASTA, BLAST and BLAT.
In essence, this class will serve as an introduction for the following in which we will focus on NGS related algorithms for sequence mapping
For an initiation on the problem of Alignment (Pairwise) you can take a look at the undergraduate Bioinformatics lecture (cb_2016_lecture_04_seqcomparison.pdf).
Acknowledgements: Thanks to Ben Langmead for slides related to the BM algorithm.
Exercise #3
For your third exercise you will need to implement the Needleman-Wunsch algorithm for pairwise alignment of genomic sequences. The input should be two DNA sequences and a scoring scheme table of the simplest type (MatchScore, MisMatch Score, GapScore). The output should be the alignment printed on a file alongside its score.
(You can read more about the algorithm and the alignment problem here: https://sites.google.com/site/uoccomputationalbiology/lectures/04-sequence-alignment).