4. Building and Running an Infernal CM

Building a CM

Building a CM from an alignment is easy - just run:

~/infernal-1.0.2/src/cmbuild your.cm your.sto

You can use your own alignment, or the example below. This will give some statistics on your CM, in particular observe the "relative entropy":

# rel entropy

# ------------

# aln cm idx name nseq eff_nseq alen clen bps bifs CM HMM

# ---- ------ -------- -------- -------- ------ ----- ---- ---- ----- -----

1 1 edited-1 7 1.39 83 73 14 1 0.696 0.510

By comparing the numbers for "CM" and "HMM" you can get an idea of how much more information the inclusion of secondary structure gives you over sequence variation alone. If these numbers are similar, this may be diagnostic of a bad alignment or secondary structure prediction.

Now to do searches with accurate E-values, you'll need to calibrate your CM. Run:

~/infernal-1.0.2/src/cmcalibrate your.cm

This will take a while to run, cmcalibrate will give you fairly accurate estimates of run time so you'll know if you have enough time to get coffee. Note that this can take a very long time (an hour for a relatively short model), so you may have to stick with uncalibrated CMs for this tutorial.

A calibrated CM for the example alignment is attached below for instant gratification.

Searching a CM

Now to search your CM, you'll need sequences in FASTA format. The EMBL Genomes pages are a great source of sequences that would be appropriate for inclusion in an RNA family.

When we were searching our single sequence using BLAST, Xenorhabdus nematophila had hits that didn't meet our criteria for inclusion. Download its genome in FASTA format from here. To run a search, type:

~/infernal-1.0.2/src/cmsearch --tblout edited.tab edited.cm FN667742.fasta

This will run your CM (edited.cm) over the Xenorhabdus sequence (FN667742.fasta) and print the results in edited.tab. Again, Infernal will give you an estimate of run time, in this case probably around 5 minutes. For larger (e.g. eukaryotic) genomes, this can be much longer.

If your search was successful, edited.tab should look something like this:

# CM: edited-1

# target coord query coord

# ---------------------- ------------

# model name target name start stop start stop bit sc E-value GC%

# ---------- ----------------------- ---------- ---------- ----- ----- -------- -------- ---

edited-1 ENA|FN667742|FN667742.1 1454810 1454867 1 73 19.76 2.19e-01 34

edited-1 ENA|FN667742|FN667742.1 1293552 1293599 1 73 18.90 3.59e-01 35

edited-1 ENA|FN667742|FN667742.1 4390466 4390532 1 73 18.64 4.17e-01 39

edited-1 ENA|FN667742|FN667742.1 1043471 1043537 1 73 18.57 4.35e-01 40

edited-1 ENA|FN667742|FN667742.1 2791866 2791938 1 73 17.83 6.63e-01 49

edited-1 ENA|FN667742|FN667742.1 1482627 1482697 1 73 17.69 7.20e-01 39

edited-1 ENA|FN667742|FN667742.1 3151264 3151323 1 73 17.44 8.28e-01 33

edited-1 ENA|FN667742|FN667742.1 3744986 3745059 1 73 17.42 8.41e-01 42

edited-1 ENA|FN667742|FN667742.1 1133199 1133128 1 73 25.10 1.02e-02 40

edited-1 ENA|FN667742|FN667742.1 1759717 1759650 1 73 23.62 2.39e-02 41

edited-1 ENA|FN667742|FN667742.1 1293632 1293572 1 73 22.40 4.81e-02 31

edited-1 ENA|FN667742|FN667742.1 318440 318388 1 73 22.07 5.83e-02 36

Note that with the default settings, Infernal only returns hits with an E-value less than 1. There are a few that seem to score well. Check out their sequence context in ENA and see if any look likely to be real homologs!

Retrieving CM hits

Now that you have a working CM and some a putative homolog, you may want to expand your alignment so it captures this additional diversity. To retrieve the highest scoring hit, run:

~/infernal-1.0.2/easel/miniapps/esl-sfetch --index FN667742.fasta

~/infernal-1.0.2/easel/miniapps/esl-sfetch -c 1133199..1133128 FN667742.fasta "ENA|FN667742|FN667742.1" > xeno_hit.fa

This will dump the highest scoring hit (coordinates 1133199 - 1133128) to xeno_hit.fa. Now, there are a few options you can take, which we will discuss in the next section.