variantcalling_zgNotes

These are the notes for variant calling from Zach's notes. These are the details for the file used for BayPass analysis.

I am using samtools and bcftools to identify variable sites in the melissa sequence data. I am not sure whether I want to use all 1573 individuals, or just wild-caught individuals (most individuals were in the selection experiment and come from just two populations). Also, some of the individuals with 'SELEXP' ids are actually wild-caught (this includes more individuals than were sampled from the other natural populations). I will first contrast two variant calling runs with all individuals or those without 'SELEXP' ids. The specific commands are below. Note I also added the -C 50 option to downgrade the quality scores of reads with many miss-matches (the value 50 is recommended for alignments with bwa), and I added the -d 50 option to only use 50 reads per individual and locus. Otherwise this options mainly match what I have used before.

wild-caught only, no 'SELEXP'

samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I aln[A-Z][A-Z][A-Z][0-9]*bam aln[A-Z][A-Z][0-9]*bam > varMelissaWild.bcf

bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaWild.bcf > varMelissaAll.vcf

all individuals

samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I *bam > varMelissaAll.bcf

bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaAll.bcf > varMelissaAll.vcf

I used samtools v 0.1.19-44428cd, which has these command options for mpileup,

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:

-6 assume the quality is in the Illumina-1.3+ encoding

-A count anomalous read pairs

-B disable BAQ computation

-b FILE list of input BAM filenames, one per line [null]

-C INT parameter for adjusting mapQ; 0 to disable [0]

-d INT max per-BAM depth to avoid excessive memory usage [250]

-E recalculate extended BAQ on the fly thus ignoring existing BQs

-f FILE faidx indexed reference sequence file [null]

-G FILE exclude read groups listed in FILE [null]

-l FILE list of positions (chr pos) or regions (BED) [null]

-M INT cap mapping quality at INT [60]

-r STR region in which pileup is generated [null]

-R ignore RG tags

-q INT skip alignments with mapQ smaller than INT [0]

-Q INT skip bases with baseQ/BAQ smaller than INT [13]

--rf INT required flags: skip reads with mask bits unset []

--ff INT filter flags: skip reads with mask bits set []

Output options:

-D output per-sample DP in BCF (require -g/-u)

-g generate BCF output (genotype likelihoods)

-O output base positions on reads (disabled by -g/-u)

-s output mapping quality (disabled by -g/-u)

-S output per-sample strand bias P-value in BCF (require -g/-u)

-u generate uncompress BCF output

SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):

-e INT Phred-scaled gap extension seq error probability [20]

-F FLOAT minimum fraction of gapped reads for candidates [0.002]

-h INT coefficient for homopolymer errors [100]

-I do not perform indel calling

-L INT max per-sample depth for INDEL calling [250]

-m INT minimum gapped reads for indel candidates [1]

-o INT Phred-scaled gap open sequencing error probability [40]

-p apply -m and -F per-sample to increase sensitivity

-P STR comma separated list of platforms for indels [all]

Notes: Assuming diploid individuals.

And I used bcftools view version 0.1.19-44428cd, which has these options,

Usage: bcftools view [options] <in.bcf> [reg]

Input/output options:

-A keep all possible alternate alleles at variant sites

-b output BCF instead of VCF

-D FILE sequence dictionary for VCF->BCF conversion [null]

-F PL generated by r921 or before (which generate old ordering)

-G suppress all individual genotype information

-l FILE list of sites (chr pos) or regions (BED) to output [all sites]

-L calculate LD for adjacent sites

-N skip sites where REF is not A/C/G/T

-Q output the QCALL likelihood format

-s FILE list of samples to use [all samples]

-S input is VCF

-u uncompressed BCF output (force -b)

Consensus/variant calling options:

-c SNP calling (force -e)

-d FLOAT skip loci where less than FLOAT fraction of samples covered [0]

-e likelihood based analyses

-g call genotypes at variant sites (force -c)

-i FLOAT indel-to-substitution ratio [-1]

-I skip indels

-m FLOAT alternative model for multiallelic and rare-variant calling, include if P(chi^2)>=FLOAT

-p FLOAT variant if P(ref|D)<FLOAT [0.5]

-P STR type of prior: full, cond2, flat [full]

-t FLOAT scaled substitution mutation rate [0.001]

-T STR constrained calling; STR can be: pair, trioauto, trioxd and trioxs (see manual) [null]

-v output potential variant sites only (force -c)

Contrast calling and association test options:

-1 INT number of group-1 samples [0]

-C FLOAT posterior constrast for LRT<FLOAT and P(ref|D)<0.5 [1]

-U INT number of permutations for association testing (effective with -1) [0]

-X FLOAT only perform permutations for P(chi^2)<FLOAT [0.01]

These are my notes for variant calling. I will using this file for the treeMix analysis. As I am rerunning this analysis I need to update details about the final files and preparation of final vcf file for the analysis.

RAN SAMtools ON THE L.MELISSA DATA.

STEP 1

The total number of populations is 565 which is the number of lines in the locuslist.txt. The list of bam files including L.melissa populations and L.ana populations which we are using as a outgroup to construct the tree in TreeMix. The list of bam files is saved in bamList.txt. The list of loci for these populations is saved in locuslist.txt.

Ran the samtools mpileup on the final alignment file on the cluster with following steps:

a).Created the following script in nano:

command used: vim nano

script saved as runmelissaVariants.sh

#!/bin/sh

#PBS -N samtools

#PBS -l nodes=1:ppn=1

#PBS -l walltime=96:00:00

#PBS -l mem=24g

#PBS -q batch

. /rc/tools/utils/dkinit

reuse SAMtools

cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams

samtools mpileup -C 50 -d 50 -l locuslist.txt -f ../../final.assembly.fasta -q 10 -Q 15 -D -g -I aln*.bam > melissaVariants.bcf

b). submit the job to the DoRC with the following command:

qsub runmelissaVariants.sh

c). To check the status of the job, use the following command:

qstat runmelissaVariants.sh

qstat -u A02079361

The job number submitted to the cluster is 102598. The output will be saved as slurm102598.out

STEP 2

Submitted two scripts to DoRC with the following commands in SAMtools saved in shell scripts called runmelissaVariants1. sh (with -v) and runmelissaVariants2.sh (without -v)

#!/bin/sh

#PBS -N samtools

#PBS -l nodes=1:ppn=1

#PBS -l walltime=96:00:00

#PBS -l mem=24g

#PBS -q batch

. /rc/tools/utils/dkinit

reuse SAMtools

cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams

bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v -l ../locuslist.txt melissaVariants.bcf > melissaVariants1.vcf

-v option: output variants file only

the analysis gave weird results with the -d option. So we REMOVED the -d option and got the results. The melissaVariants2.vcf is the working file and will be used for further analysis.

My final input script was the following:

#!/bin/sh

#PBS -N samtools

#PBS -l nodes=1:ppn=1

#PBS -l walltime=96:00:00

#PBS -l mem=24g

#PBS -q batch

. /rc/tools/utils/dkinit

reuse SAMtools

cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams

bcftools view -c -g -p 0.01 -P full -t 0.001 -v -l ../locuslist.txt melissaVariants.bcf > melissaVariants2.vcf

The script is saved in the directory cd /labs/evolution/projects/Lmelissa_hostplantadaptation

The melissaVariants2.vcf is in the bams folder in variants folder in Lmelissa_hostplant

To check the number word counts for the scaffolds in the given locuslist I used the following command:

grep -c *scaf path/melissaVariants2.vcf

STEP 2

Now I converted the vcf file Lmelissa_variants.vcf to a gl file by using the perl script vcf2gl.pl and got the outfile LmelissaVariants.gl

perl ../scripts/vcf2gl.pl Lmelissa_variants.pl

Next I split the populations in the previous file using the perl script splitPops.pl and received population files like LmpopYBG..etc.

perl ../scripts/splitPops.pl LmelissaVariants.plhttps://sites.google.com/site/gompertlabnotes/home/samridhi-chaturvedi/l-melissa-genomic-analysis

STEP 3

ZG did the following:

Go to the directory labs/evolution/scratch/melissa. Has all the hdf5 files for the two mcmc runs as 0 and 1.

./estpost_popmod

#assign interactive job on DoRC

salloc --nodes=1 --ntasks-per-node=12 -p interactive

#running the estpost_popmod

./estpost_popmod -o p_ABC.txt -p p -s 0 -w 0 Lmpop_ABCCh*hdf5

./estpost_popmod -o g_ABC.txt -p g -s 0 -w 0 Lmpop_ABCCh*hdf5

SUBMITTING THE PERL SCRIPT calCG.pl to do the above task for all populations at once

chmod a+x calcG.pl

ls -lh calcG.pl

./calcG.pl Lmpop_*0.hdf5

#check these files in HDFView.

HDFview will have 2 files: p and q

1). g(or genotype-shows posterior probability for that genotype in the population)

This will generate 3 files:0,1,2

0 is for reference homozygous (AA)

1 is for heterozygous (Ab)

2 is for alternate homozygous (BB)

Each row is the locus. In this file the loci are 184287. The columns are the post. prob. for that genotype for each individuals. In this population there are 19 individuals.

-If the post. prob value is 1, then the genotype is taken as true.

-If the value is in decimals, then the estpost_popmod script/programme will calculate the weighted average of the values in the three files 0,1,2 for that individual at that locus.

2). p)or population allele frequencies)

Row is the locus.

Column is the allele frequency generated at each step of the mcmc chain which is 2000 in this file as we gave 2 mcmc chains to run.

Go to R

G<-matrix(scan("g_ABC.txt",n=184287*19,sep=","),nrow=19,ncol=184287,byrow=T)

#Note that 19 is the population size for this population and will vary for all populations.

dim(G)

P<-apply(G,2,sum)

P<-round(apply(G,2,sum),0)

#38 will change according to population size. It will be population size*2 as the population is diploid so the genotype will double.

PP<-cbind(P,38-P)

write.table(PP,"pp_ABC.txt",sep=",",row.names=F,col.names=F)

After creating all the p and g files for all the populations, I created the input file for TreeMix analysis.

Goto R

P1<-read.table("p_ABC.txt")

P2<-read.table("p_BSY.txt")

Lmelissa_inputfile<-merge(P1,P2)

colnames(Lmelissa_inputfile)<-c("ABC","BSY")

write.table(Lmelissa_inputfile,"Lmelissa_inputfile",col.names=T,row.names=F,quote=F)

The input file for Treemix is saved as Lmelissa_inputfile in the directory projects/Lmelissa_host_adaptation/Lmelissa_treemix

Page updated

Google Sites

Report abuse