alignment, variant calling, and allele frequencies for 2016 cline samples

Post date: Feb 12, 2017 3:22:40 AM

We have GBS data from a transect up a single mountain. Patrik has a student that will be running some initial geographic cline analyses with this data, with an emphasis on contrasting clinal patterns on LG8 (scaf 128) versus other scaffolds.

The flowcell with these data also includes other data for GWA that I am not working with. The data are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/seqs/ and all of the cline samples afre in lanes iv-vii (see the barcodes subdirectory too, the ids are all 16*.

Here is what I did.

1. Parse reads and split by individual. The fastq files are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/parsed/. There are 238 individuals.

2. Ran the alignments with wrap_qsubl_slurm_bwa.pl. This runs bwa 0.7.10-r789. Here is the main part of the script with the command line options:

## build array of jobs to be run individually (serially) by slurm

my $dir = '/uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/parsed/';

my @jobarray;

my $aln;

my $samse;

my $ind;

my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta";

my $cd = "cd $dir\n";

foreach my $file (@ARGV){

if ($file =~ m/^(16_[a-z0-9_]+)/){

$ind = $1;

}

else {

die "Failed to match $file\n";

}

$aln ="bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -f aln"."$ind".".sai $genome $file\n";

$samse ="bwa samse -n 1 -r \'\@RG\\tID:tcr-"."$ind\\tPL:ILLUMINA\\tLB:tcr-"."$ind\\tSM:tcr-"."$ind"."\' -f aln"."$ind".".sam $genome aln"."$ind".".sai $file\n";

push (@jobarray, "$cd"."$aln"."$samse"."");

print "$cd"."$aln"."$samse";

}

3. Compressed, sorted and indexed the alignments with samtools 1.2 (using htslib 1.2.1). This generated the *sorted.bam files in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/alignments/.

4. Variant calling with samtools mpileup and bcftools call (bcftools version 1.3 (using htslib 1.3)). I ran variant calling via the callvar.sh submission script, here is the relevant bit:

module load samtools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/alignments

samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 30 -I -u -g -t DP,DPR -o tcrClineVars.bcf

bcftools call -v -m -p 0.01 -P 0.001 -O v -o tcrClineVars.vcf tcrClineVars.bcf

This generated 1,041,135 SNVs without filtering (in tcrClineVars.vcf).

5. Two rounds of variant filtering (in the variants subdirectory in data):

Key parameters in vcfFilter.pl

my $N = 238;

#### stringency variables, edits as desired

my $minCoverage = $N * 2; # minimum number of sequences; DP = 2X

my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 0.2 * $N; # maximum number of individuals with no data

my $minMaf = $N * 2 * 0.01; # minimum MAF, as allele count ~ 0.01

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

perl vcfFilter.pl tcrClineVars.vcf

233,744 loci were retained in filtered2x_tcrClineVars.vcf.

#### stringency variables, edits as desired

my $maxCoverage = 3329; # maximum depth to avoid repeats, mean + 3sd

my $mind = 4; # don't call if SNPs are closer together than this

##### this set is for whole genome shotgun data

perl filterSomeMore.pl filtered2x_tcrClineVars.vcf

Retained 169,429 SNVs in morefilter_filtered2x_tcrClineVars.vcf.

Convert to gl, retain only common (maf > 5%) SNPs.

perl vcf2gl.pl 0.05 morefilter_filtered2x_tcrClineVars.vcf

The file tcrCline_commonvars.gl contains 66,904 common SNPs.

6. Split populations and estimate allele frequencies.

Page updated

Google Sites

Report abuse