Post date: Nov 03, 2014 5:4:56 PM
I finished assembling the new Lycaeides genomes. I ran the assemblies with allpaths-lg and platanus (a new assembler that is supposed to work well for samples with high heterozygosity), but allpaths-lg consistently out-performed platanus with our data so I will stick with allpaths-lg for now. There are interesting differences in genome assembly quality and possibly in genome size. This plot shows the cumulative size of the genome contained in N scaffolds (with scaffolds sorted by size). This is the same as the plot Alex sent out for the first L. melissa genome. A steep initial slope on these plots means that we have many large scaffolds and the asymptote shows the total genome size in our assembly. The genomes for the Sierra Nevada and Warner samples look really good (the Sierra Nevada sample was the one I sent an e-mail about earlier), with most of the genome contained in a modest number of scaffolds (at least relative to our original L. melissa genome, which is called 'L. melissa (old)' on the plot). On the other hand, the new L. melissa sample and the L. anna sample look more or less like the old L. melissa sample. I don't know whether these differences simply reflect differences in the quality of the data or in the underlying genome composition of these entities (i.e. more or fewer repeats, etc.), but the fact that the two hybrid lineages look similar and the two L. melissa samples look similar is suggestive of biological differences (but keep the small sample size in mind). Perhaps the most interesting thing that we see, however, is that while the amount of assembled genome for L. anna, L. melissa, Sierra Nevada, and Warner is about the same (~350 mb) and constitutes about 80% of the expected Lycaeides genome size (~ 440 mb if I recall correctly), the assembled L. idas genome comes in at 500 mb. This is bigger than we expected, particularly if we assume that we are still only assembling about 80% of the genome.
I have begun investigating different approaches to compare these genome assemblies both in terms of content (do they have the same stuff) and synteny (are the same things in the same places). I have played around with two programs for this: MUMmer 3 and SatsumaSynteny. MUMmer was pretty quick, but the initial results were noisy (perhaps because I didn't restrict this to true mums), and SatsumaSynteny is slower and I can't seem to generate meaningful visualizations.
I am trying MUMmer 3.0 again with potentially more meaningful options (paper, documentation). There are 3 steps to the alignment. First, perfect matches are identified between short sequences (-l 15, 15 bp). Only unique matches are used (-mum), but this doesn't have to be the case. These are then grouped into clusters if they are with -g 1000 base pairs of each other. Only clusters -c 100 bp or more in length are retained. I am running MUMmer with L. warner and L. sierra first:
#!/bin/sh
#PBS -N genome
#PBS -l nodes=1:ppn=1
#PBS -l walltime=96:00:00
#PBS -l mem=96g
#PBS -q batch
. /rc/tools/utils/dkinit
cd /home/A01963476/data/lycaeides/whole_genomes/
/home/A01963476/Source/MUMmer3.23/nucmer --mum -c 100 -l 15 -g 1000 --prefix=mumLsierra_Lwarner Lsierra/DATA/RUN/ASSEMBLIES/assem15sept14/final.assembly.fasta Lwarner/DATA/RUN/ASSEMBLIES/assem12oct14/final.assembly.fasta
Next I will need to process the output with show-coords, show-aligns, and mummerplot as described for section '4.2 Aligning two draft sequences' here.