Assignment 1: Sequence Distance and Relatedness

The goal of this assignment is to 1) download two closely related entries from a biological database and 2) describe how the relatedness or distance between these entries can be defined and computed. We decided to focus on sequence similarity and used BLAST to retrieve a closely related sequence.

Our process for finding two closely related sequences

First we downloaded two 16s ribosomal RNA gene sequences from Genbank on NCBI from the genomes of Escherichia coli str. K-12 substr. MG1655 (NC_000913.3 [4166659...4168200] ) and Shigella flexneri 2a str. 301 (NC_004337.2 [3410014...3411555] ). We decided to use the 16s gene, because it is ubiquitous and relatively small, making it easier for us to visualize here. We also know that these two bacteria are part of the Enterobacteriaceae family and expect that they are similar.

Then we ran BLAST to align these two sequences on NCBI, which we will describe in greater detail below.

Below is an image showing the alignment and BLAST result between these two sequences.

There are also different measures expressing the similarity or distance between the entries, which are listed here and described in greater detail below. Max score: 2509, Total score: 2509, Query cover: 100%, E-value 0.0, Identity 99%.

What are the measures for similarity and distance for these sequences and how did we compute it

BLAST is an algorithm which allows for rapid sequence comparisons across database entries, retrieving sequences that are similar to a query sequence. BLAST uses a heuristic algorithm taking short strings of nucleotides from the target sequence and finding matches to entries in the database. Then, it attempts an alignment, beginning with these short matches. Unlike Waterman-Smith algorithm, it does not guarantee an optimum alignment, but it very quickly finds closely related database entries.

It returns several measures of similarity and distance for gene sequences, listed below...

  • Percent Identity: The percent of shared amino acids or nucleotides at the same position in the alignment.
  • E-score: Number of scores of equivalent or greater value that would occur in the database search by chance.
  • p-value: Probability an alignment with a random sequence would produce the same score or better.

References:

https://www.ncbi.nlm.nih.gov/books/NBK62051/

http://resources.qiagenbioinformatics.com/manuals/clcmainworkbench/current/index.php?manual=How_does_BLAST_work.html

http://bitesizebio.com/26522/blast-off-the-basic-local-alignment-search-tool-explained/


How did we define similarity and distance?


For this case we define distance as the edit distance between the two sequences. We score a gap and mismatch penalty both as 1. Therefore, we can think of the distance as the number of total edits required to align the sequences.

Distance = (# of Mismatches) (Mismatch Penalty) + (# of Gaps) (Gap Penalty)

For the two sequences we chose, there were 7 mismatches and 0 gaps, so we say the two sequences have a distance of 7.


We then define similarity based on this edit distance to think of how many changes it would take to make one sequence an exact copy of the other. We create a semi-global similarity score determined by one minus the value of the edit distance divided by the length of the shorter sequence.

Similarity = 1 - (Distance / Length of Shorter Sequence)

We calculated the edit distance to be 7, and both sequences here are 1542 base pairs long. Using the defined formula, the similarity score for the two sequences is 0.9955.