2. Finding Putative Homologs

For this exercise, we will be using E. coli sRNA sequences taken from Gisela Storz's spreadsheet. Feel free to use your own sequences instead! Most of what is discussed here will extend to any organism(s) - though obviously the relevant taxonomic distinctions will be different!

Attached are a number of sRNA sequences which appear to be widely distributed within the Enterobacteriaceae (and beyond!) We will start with MicA for this tutorial, then you can move on to another.

Our first task in building an alignment is to identify a collection of close homologs we can use to begin constructing our profile. We'll use NCBI-BLAST for this purpose. Make sure to change the settings on the search form so that we're searching Nucleotide collection (nr/nt), limited to the Enterobacteriaceae. Set max target sequences to 1000 and the expect threshold to 100 to see more distant hits. Setting the word size to 7 will improve sensitivity. Once you've done this, search the MicA sequence. The NCBI results provide a number of valuable statistics, including the percent coverage, e-value (expected number of hits with this score), and percent identity. The alignments also give sequence context for hits.

You may wish to visualize your hits to get a better idea of their context. The ENA provides an online genome browser which will allow you to do this quickly and easily. For an example for MicA, try looking at the Yersinia enterocolitica 8081 genome. The sequence coordinates of its MicA homolog should be 966434-966505. ENA also allows you to retrieve the sequence for your hits. Another good option for working with bacterial sequences is the UCSC Microbial Genome Browser if your genomes of interest are available. The UCSC browser contains many useful tracks, such as operon predictions and conservation information.

Now take this chance to select some representative hits to build your initial alignment from. The following criteria seem to work well for most organisms:

Try select from a variety of organisms. What exactly this means will depend on what organisms you're working on. For now, in bacteria, we'll limit ourselves to one or two sequences per species, but depending on divergence you may want to limit on strains or genera.
Try to select sequences that conserve synteny - this provides an independent line of evidence that the RNA in question is a direct homolog.
Take sequences between 70% and 90% identity. This will hopefully capture some sequences with enough diversity to contribute to secondary structure prediction, while still being alignable by conventional (non-Sankoff) methods.
Take sequences with 100% coverage, unless you're unsure of the boundaries of your molecule.
Take sequences with an E-value of .01 or less, this should help to limit sequences to high-confidence matches.

Take notes of sequences, particularly whole genomes, which just miss these criteria on coverage, percent id, or E-value; these are good candidates for further searches.

You will want to put these sequences in FASTA format for alignment. See below for an example collection. Run these sequences individually through RNAfold to see how the single sequence predictions differ.

>U00096.2

GAAAGACGCGCATTTGTTATCATCATCCCTGAATTCAGAGATGAAATTTTGGCCACTCACGAGTGGCCTTTTT

>FQ312003

GAAAGACGCGCATTTGTTATCATCATCCCTGTTTTCAGCGATGAAATTTTGGCCACTCCGTGAGTGGCCTTTTT

>CP002272

GAAAGACGCGCATTTGTTATCATCATCCCTGACTTCAGAGATGAAATGTTTGGCCACAGTGATGTGGCCTTTTT

>CP002910

GAAAGACGCGCATTTATTATCATCATCATCCCTGAATCAGAGATGAAAGTTTGGCCACAGTGATGTGGCCTTTTT

>AM286415

GAAAGACGCGCATTTGTTATCATCATCCCTGTTATCAGAGATGTTAATTTGGCCACAGCAATGTGGCCTTTT

>CP002433

GAAAGACGCGCATTTGTTATCATCATCCCTGACAACAGAGATGTTAATTCGGCCACAGTGATGTGGCCTTTT

>FP236842

GAAAGACGCGTATTTGTTATCATCATCTCATCCCTGACAACAGAGATGTTAATTTAGGCCACAGTGACGTGGCCTTTTT

Page updated

Google Sites

Report abuse