Con los contigs generados en el ensamble de lecturas, se pude armar un esqueleto del genoma (scaffold) usando un genoma de referencia que este terminado o cerrado.
Esto se puede hacer con dos aplicaciones.
Existe también el programa Scaffold Builder para ensambles en los cuales los gaps se rellenan con Ns. Puede usarse en una página web o bien por comandos desde la terminal. Se necesitan los contigs ensamblados en formato fasta y la cepa de referencia también en formato fasta. Si la cepa de referencia esta en formato GenBank es necesario convertirla con GB2fasta:
$ GB2fasta.pl reference.gbk reference.fasta
Scaffold builder es un programa de python
$ python /usr/bin/scaffold_builder.py -q query_contigs.fna -r reference.fna -p sb
Medusa tiene una página web donde se pueden subir los archivos, o bien correrla en la terminal (ver abajo requisitos). Medusa realiza el scaffold basándose en uno mas genomas completos que deben estar en un folder en formato fasta.
$ java -jar /opt/medusa/medusa.jar -f /reference_genomes/ -i contigs.fna -v -o scaffold.fna
The following inputs are required:
The following output files will be produced:
The project folder must contain:
- the *targetGenome* in fasta format.
- the medusa.jar file
- the scripts sub-folder “medusa_scripts”.
- the comparison genomes sub-folder “drafts”. (In alternative you can
specify another path for this folder usinf the "-f" option)
Medusa can be run with the following parameters:
1. The option *-i* is required and indicates the name of the target
genome file.
2. The option *-o* is optional and indicates the name of output fasta
file.
3. The option *-v* (recommended) print on console the information given
by the package MUMmer. This option is strongly suggested to
understand if MUMmer is not running properly.
4. The option *-f* is optional and indicates the path to the comparison
drafts folder.
5. The option *-random* is available (not required). This option allows
the user to run a given number of cleaning rounds and keep the best
solution. Since the variability is small, 5 rounds are
usually sufficient to find the best score.
6. The option *-w2* is optional and allows for a sequence similarity
based weighting scheme. Using a different weighting scheme may lead
to better results.
7. The option *-d* allows for the estimation of the distance between pairs of contigs based on the reference genome(s):
in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate.
The estimated distances are also saved in the "*_distanceTable" file.
By default the scaffolded contigs are separated by 100 Ns.
8. The *-gexf* is optional. With this option the gexf format of the contig network and
the path cover are porvided.
9. The option *-n50* allows the calculation of the N50 statistic on a FASTA file.
In this case the usage is the following: java -jar medusa.jar -n50 <name_of_the_fasta>
All the other options will be ignored.
10. Finally the *-h* option provides a small recap of the previous ones.