DNA Alignment and Variant Calling

Post date: Jul 09, 2018 10:5:48 PM

0. Index fasta file

path: Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed

bwa index cmacgenome.fasta

1. Alignment with bwa aln and samse (version 0.7.17-r1188).

Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed

perl ../scripts/wrap_qsub_slurm_bwa.pl clean_*fastq

The above perl script runs bwa with the mem option enabled rather than aln. This is done due to the longer reads present.

Outputs aln_clean_*.sam files which are moved to /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/alignments

1.5 Check proportion of reads aligned

for i in *.sam; do samtools flagstat $i; done

I didn't observe any which were below 83%

2. Used samtools (1.5 with htslib 1.5) to convert sam to bam, etc.

Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/alignments

perl ../Scripts/Sam2BamFork.pl aln_clean_*sam

Edit for GATK: GATK did not run correctly, perhaps something wrong with how CHPC handles multithreading. Instead used variant calling steps from the L14 paper.

3. Variant calling, first using GATK's haplotype caller to generate individual g.vcf files.

3.1 Generate fasta file index for GATK

samtools faidx cmacgenome.fasta

3.2 Generate dictionary file for GATK

java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=cmacgenome.fasta O=cmacgenome.dict

From the alignments subdirectory I ran

perl ../scripts/wrap_qsub_slurm_gatk.pl *sorted.bam

4. Combine GVCF files

First, I combined the 748 g.vcf files into 4 combined g.vcf files with 196 individuals each (this speeds up variant calling).

ls *.g.vcf > vcfs.txt

sed -i 's/^/--variant /' vcfs.txt

split -l 187 vcfs.txt splitvcf

cat splitvcfaa | tr -d '\n' >> basea.txt

Edit so that it's a continuous command

cat base* > jobs.txt

sed -i -e 's/vcf/vcf /g' jobs.txt

sbatch SubComb.sh

The above shell script runs runCombineGVCFs.pl

5. Joint variant calling

3. Call Var

samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/cmac_QTL_AR/alignments/bams/cmacgenome.fasta -q 20 -Q 30 -I -u -g -t DP,DPR -o calloVars.bcf

~/bin/bcftools call -v -m -p 0.01 -P 0.001 -O v -o calloVars.vcf calloVars.bcf

4. Filter Variants based on depth, mapping quality, etc; using version with data for >= 80% of individuals, keeping 28,249 SNPs. Creates filtered2x_calloVars.vcf

perl ../scripts/vcfFilter.pl calloVars.vcf

#stringency variables

my $minCoverage = 1526; # minimum number of sequences; DP, 1344 = 2X

my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 152; # maximum number of individuals with no data

my $minMaf = 7; # minimum MAF, as allele count ~ 0.005

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

5. Filter some more based on high SNP density. SNPs must be at least 4bp apart, plus gets rid of high coverage snps. Creates morefilter_filtered2x_calloVars.vcf

grep ^sc filtered2x_calloVars.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > depth.txt

4841+12037

grep ^sc filtered2x_calloVars.vcf | cut -f 1 > scafcal.txt

open the scaf file in vim and remove the scaf- part.

%s/scaffold_//

grep ^sc filtered2x_calloVars.vcf | cut -f 2 > poscal.txt

#create final file

paste scafBart.txt posBart.txt > snpsBart.txt

#stringency variables

my $maxCoverage = 16878; # maximum depth to avoid repeats, mean + 2sd

my $mind = 4; # don't call if SNPs are closer together than this

Retained 34284 variable loci

6. Convert to genotype likelihood format. Creates cmacL14.gl

perl ../scripts/vcf2gl.pl 0 morefilter_filtered2x_calloVars.vcf

Number of loci: 34284; number of individuals 748

Split genotype likelihoods by population

perl projects/cmac_AR/scripts/splitsPops.pl projects/cmac_AR/likelihood_data/cmacL14.gl

Create allele freq by population:

path: /uufs/chpc.utah.edu/common/home/u0795641/projects/cmac_qtl/likelihood_data

perl wrap_qsub_slurm_popmod.pl cmacseries*

paremeters:

~/bin/popmod -i $file -o outp_bayes_$pop".".hdf5 -b 1000 -n 10000 -t 5 -a 0.5

Create txt files of allele frequency (p) and genotypes (g)

module load hdf5 gsl gcc

~/bin/estpost_popmod -p g -s 0 -w 1 -o g_L1.txt outp_bayes_L1.hdf5

~/bin/estpost_popmod -p g -s 0 -w 0 -o g_L2.txt outp_bayes_L2.hdf5

~/bin/estpost_popmod -p g -s 0 -w 0 -o g_L14A.txt outp_bayes_L14A.hdf5

7. Variant calling for L14 and L1/2 references results:

L14: Number of loci: 13404; number of individuals 48 (from F16A)

L1/2/M: Number of loci: 21591; number of individuals 128

Page updated

Google Sites

Report abuse