Post date: Feb 12, 2017 3:22:40 AM
We have GBS data from a transect up a single mountain. Patrik has a student that will be running some initial geographic cline analyses with this data, with an emphasis on contrasting clinal patterns on LG8 (scaf 128) versus other scaffolds.
The flowcell with these data also includes other data for GWA that I am not working with. The data are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/seqs/ and all of the cline samples afre in lanes iv-vii (see the barcodes subdirectory too, the ids are all 16*.
Here is what I did.
1. Parse reads and split by individual. The fastq files are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/parsed/. There are 238 individuals.
2. Ran the alignments with wrap_qsubl_slurm_bwa.pl. This runs bwa 0.7.10-r789. Here is the main part of the script with the command line options:
## build array of jobs to be run individually (serially) by slurm
my $dir = '/uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/parsed/';
my @jobarray;
my $aln;
my $samse;
my $ind;
my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta";
my $cd = "cd $dir\n";
foreach my $file (@ARGV){
if ($file =~ m/^(16_[a-z0-9_]+)/){
$ind = $1;
}
else {
die "Failed to match $file\n";
}
$aln ="bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -f aln"."$ind".".sai $genome $file\n";
$samse ="bwa samse -n 1 -r \'\@RG\\tID:tcr-"."$ind\\tPL:ILLUMINA\\tLB:tcr-"."$ind\\tSM:tcr-"."$ind"."\' -f aln"."$ind".".sam $genome aln"."$ind".".sai $file\n";
push (@jobarray, "$cd"."$aln"."$samse"."");
print "$cd"."$aln"."$samse";
}
3. Compressed, sorted and indexed the alignments with samtools 1.2 (using htslib 1.2.1). This generated the *sorted.bam files in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/alignments/.
4. Variant calling with samtools mpileup and bcftools call (bcftools version 1.3 (using htslib 1.3)). I ran variant calling via the callvar.sh submission script, here is the relevant bit:
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/2016_gwas_trad_Patrik_clines/alignments
samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 30 -I -u -g -t DP,DPR -o tcrClineVars.bcf
bcftools call -v -m -p 0.01 -P 0.001 -O v -o tcrClineVars.vcf tcrClineVars.bcf
This generated 1,041,135 SNVs without filtering (in tcrClineVars.vcf).
5. Two rounds of variant filtering (in the variants subdirectory in data):
Key parameters in vcfFilter.pl
my $N = 238;
#### stringency variables, edits as desired
my $minCoverage = $N * 2; # minimum number of sequences; DP = 2X
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 0.2 * $N; # maximum number of individuals with no data
my $minMaf = $N * 2 * 0.01; # minimum MAF, as allele count ~ 0.01
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
perl vcfFilter.pl tcrClineVars.vcf
233,744 loci were retained in filtered2x_tcrClineVars.vcf.
#### stringency variables, edits as desired
my $maxCoverage = 3329; # maximum depth to avoid repeats, mean + 3sd
my $mind = 4; # don't call if SNPs are closer together than this
##### this set is for whole genome shotgun data
perl filterSomeMore.pl filtered2x_tcrClineVars.vcf
Retained 169,429 SNVs in morefilter_filtered2x_tcrClineVars.vcf.
Convert to gl, retain only common (maf > 5%) SNPs.
perl vcf2gl.pl 0.05 morefilter_filtered2x_tcrClineVars.vcf
The file tcrCline_commonvars.gl contains 66,904 common SNPs.
6. Split populations and estimate allele frequencies.