Post date: Nov 07, 2013 5:28:6 PM
I split the parsed melissa sequences into individual fastq files. These are in data/lycaeides/lycaeides_gbs/Parsed_Melissa/IndSeqFiles/. I also made links to the relevant individual sequence files from the admixture project. We have melissa form 26 populations plus the selection experiment bugs, with 1573 individuals total. The script for splitting fastq files is in the lycaeides_gbs Scripts directory.
I will use bwa to align sequences to the melissa draft genome reference. I first made a symbolic link to the gnome (final.assembly.fasta) here: data/lycaeides/lycaeides_gbs/Assemblies/melissaGbs2genome.
I then used the bwa 'is' algorithm to index the genome. Here is the command (note bwa is loaded on dorc through the dotkit, use BWA or resuse BWA).
bwa index -a is final.assembly.fasta
The 'is' algorithm is good for smaller genomes, less than 2 Gb (not sure of this is bases or bytes, but either way the draft melissa genome is smaller than this).
Now I am using bwa aln and bwa samse to map the sequences to the genome. I am using bwa version 0.7.5a-r405. Here are the exact command line options and defaults for this version:
Usage: bwa aln [options] <prefix> <in.fq>
Options: -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
-o INT maximum number or fraction of gap opens [1]
-e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
-i INT do not put an indel within INT bp towards the ends [5]
-d INT maximum occurrences for extending a long deletion [10]
-l INT seed length [32]
-k INT maximum differences in the seed [2]
-m INT maximum entries in the queue [2000000]
-t INT number of threads [1]
-M INT mismatch penalty [3]
-O INT gap open penalty [11]
-E INT gap extension penalty [4]
-R INT stop searching when there are >INT equally best hits [30]
-q INT quality threshold for read trimming down to 35bp [0]
-f FILE file to write output to instead of stdout
-B INT length of barcode
-L log-scaled gap penalty for long deletions
-N non-iterative mode: search for all n-difference hits (slooow)
-I the input is in the Illumina 1.3+ FASTQ-like format
-b the input read file is in the BAM format
-0 use single-end reads only (effective with -b)
-1 use the 1st read in a pair (effective with -b)
-2 use the 2nd read in a pair (effective with -b)
-Y filter Casava-filtered sequences
Usage: bwa samse [-n max_occ] [-f out.sam] [-r RG_line] <prefix> <in.sai> <in.fq>
I am running bwa aln and bwa samse through a batch submission to the dorc cluster (wrap_qsub_rc_bwa.pl in the lycaeides_gbs Scripts directory). Here is an example of the options I used:
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnABC12-03.sai final.assembly.fasta /home/A01963476/data/lycaeides/lycaeides_gbs/Parsed_Melissa/IndSeqFiles/ABC12-03.fastq
bwa samse -n 1 -r '@RG ID:lycaeides-ABC12-03' -f alnABC12-03.sam final.assembly.fasta alnABC12-03.sai /home/A01963476/data/lycaeides/lycaeides_gbs/Parsed_Melissa/IndSeqFiles/ABC12-03.fastq