For this exercise, we will be using E. coli sRNA sequences taken from Gisela Storz's spreadsheet. Feel free to use your own sequences instead! Most of what is discussed here will extend to any organism(s) - though obviously the relevant taxonomic distinctions will be different!
Attached are a number of sRNA sequences which appear to be widely distributed within the Enterobacteriaceae (and beyond!) We will start with MicA for this tutorial, then you can move on to another.
Our first task in building an alignment is to identify a collection of close homologs we can use to begin constructing our profile. We'll use NCBI-BLAST for this purpose. Make sure to change the settings on the search form so that we're searching Nucleotide collection (nr/nt), limited to the Enterobacteriaceae. Set max target sequences to 1000 and the expect threshold to 100 to see more distant hits. Setting the word size to 7 will improve sensitivity. Once you've done this, search the MicA sequence. The NCBI results provide a number of valuable statistics, including the percent coverage, e-value (expected number of hits with this score), and percent identity. The alignments also give sequence context for hits.
You may wish to visualize your hits to get a better idea of their context. The ENA provides an online genome browser which will allow you to do this quickly and easily. For an example for MicA, try looking at the Yersinia enterocolitica 8081 genome. The sequence coordinates of its MicA homolog should be 966434-966505. ENA also allows you to retrieve the sequence for your hits. Another good option for working with bacterial sequences is the UCSC Microbial Genome Browser if your genomes of interest are available. The UCSC browser contains many useful tracks, such as operon predictions and conservation information.
Now take this chance to select some representative hits to build your initial alignment from. The following criteria seem to work well for most organisms:
Take notes of sequences, particularly whole genomes, which just miss these criteria on coverage, percent id, or E-value; these are good candidates for further searches.
You will want to put these sequences in FASTA format for alignment. See below for an example collection. Run these sequences individually through RNAfold to see how the single sequence predictions differ.
>U00096.2
GAAAGACGCGCATTTGTTATCATCATCCCTGAATTCAGAGATGAAATTTTGGCCACTCACGAGTGGCCTTTTT
>FQ312003
GAAAGACGCGCATTTGTTATCATCATCCCTGTTTTCAGCGATGAAATTTTGGCCACTCCGTGAGTGGCCTTTTT
>CP002272
GAAAGACGCGCATTTGTTATCATCATCCCTGACTTCAGAGATGAAATGTTTGGCCACAGTGATGTGGCCTTTTT
>CP002910
GAAAGACGCGCATTTATTATCATCATCATCCCTGAATCAGAGATGAAAGTTTGGCCACAGTGATGTGGCCTTTTT
>AM286415
GAAAGACGCGCATTTGTTATCATCATCCCTGTTATCAGAGATGTTAATTTGGCCACAGCAATGTGGCCTTTT
>CP002433
GAAAGACGCGCATTTGTTATCATCATCCCTGACAACAGAGATGTTAATTCGGCCACAGTGATGTGGCCTTTT
>FP236842
GAAAGACGCGTATTTGTTATCATCATCTCATCCCTGACAACAGAGATGTTAATTTAGGCCACAGTGACGTGGCCTTTTT