Post date: Aug 27, 2017 7:19:42 PM
Previously I processed data for Patrik's student Freddie for cline analyses on the Refugio mountain (described here). Freddie did everything post variant calling/genotype inference. The results turned out to be pretty cool, such that the proportion of LG8 with reduced introgression on Refugio was narrower (a bit) then the region associated with stripe based on GWA with non-Refugio populations (it was closer to the color region). This suggested that perhaps the genetic architecture of the trait actually differed between mountains. Freddie mapped on Refugio and found the narrower region. Now he will run clines with the non-Refugio data with the prediction that a wider region will show reduced introgression.
I am copying the data from Victor. They will be in king:~/data/timema/clines/hyw154/. Here is Victor's description of what he did (I am working with 03_split as he suggested).
I am sorry this take a bit longer than expected. I will quickly update Zach, as I don't think he knows about the issues we have had with this library. This is the first time we have done paired-end GBS with an updated protocol. While I was demultiplexing the reads, I found that a rather high proportion of reads didn't have barcodes (~20%). Clarissa and Freddie did some troubleshooting to understand what happened. They found out that a few barcodes consistently failed (causing ~10% samples to fail) and explains pretty well the reduced demultiplexing. Fortunately, the samples affected seem not to be crucial ones (you can easily deduce which ones they are by checking the report file, see below). I then decided to chek the data a bit more in detail and found a lot of reads (~50%) have adapters, pcr dimers, pcr primers, or some contamination. Clarissa thinks this could be because they used a slightly modified protocol with more PCR cycles (correct me if this is wrong). Anyhow, I therefore decided to clean and filter the sequences with bbduk before demultiplexing. I also mapped them with bowtie2 to the dovetail 1.3c2 reference genome to see how things look like. After all the filtering, the alignment rate is a bit lower than what we used to obtain before, with an average of ~90% of reads mapped and a few individuals with very low alignment rate, as bad as only 12% for 16_p0294_3 and a few ones with <60% (16_p0293_7, 16_p0085_4, 16_p0292_1,16_p0299_9). That being said, the final absolute numbers don't look bad at all when compared to those of previous libraries. For example, the last single reads GBS library (the one that included Patrik's clines) have an average of ~500k mapped reads per sample (~630k if only cristinae samples are considered, 83 bp per read on average), whereas the average per sample for this library is 2 x ~1m reads (71bp per read on average; and that is considering all the samples that failed). All in all, I would say it is a good start with the new protocol.
You can find all the files in diogenes:/fastdata/bo1vsx/Freddie_clines. The raw reads are in 00_raw, filtered reads in 01_flt, parsed and split by sample in 03_split (Zach - these are the ones you may want to use), and the mapped reads in 04_map. You can find a summary report with number of reads and bps filtered and mapped in summary.report.dsv. Zach - if you prefer the unfiltered ones, you can find them in unflt/02_split.
Here is what I did (working in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/clines/hyw154).
1. Paired-end alignment with bwa aln and sampe. Note that I matched the conditions for Refugio, except where paired-end alignment has unique options. I used bwa version 0.7.10-r789. From withing reads 1 I ran,
perl ../scripts/wrap_qsub_slurm_bwa.pl *bz2
Here is an example for one individual:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/clines/hyw154/
bzip2 -dk reads1/16_p0453_4.fq.bz2
bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -f reads1/aln_1_16_p0453_4.sai /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta reads1/16_p0453_4.fq
bzip2 -dk reads2/16_p0453_4.fq.bz2
bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -f reads2/aln_2_16_p0453_4.sai /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta reads2/16_p0453_4.fq
bwa sampe -n 1 -N 1 -r '@RG\tID:tcr-16_p0453_4\tPL:ILLUMINA\tLB:tcr-16_p0453_4\tSM:tcr-16_p0453_4' -a 500 -f align/aln16_p0453_4.sam /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta reads1/aln_1_16_p0453_4.sai reads2/aln_2_16_p0453_4.sai reads1/16_p0453_4.fq reads2/16_p0453_4.fq
And here are the defaults/usage for sampe,
Usage: bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
Options: -a INT maximum insert size [500]
-o INT maximum occurrences for one end [100000]
-n INT maximum hits to output for paired reads [3]
-N INT maximum hits to output for discordant pairs [10]
-c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
-f FILE sam file to output results to [stdout]
-r STR read group header line such as `@RG\tID:foo\tSM:bar' [null]
-P preload index into memory (for base-space reads only)
-s disable Smith-Waterman for the unmapped mate
-A disable insert size estimate (force -s)
2. Compress, sort and index the alignments. This is now using samtools 1.5. This must have been updated on the cluster. Also, note that I am using a new approach to running parallel jobs with perl (via ForkManager). From the alignments subdirectory I ran,
perl ../scripts/wrap_qsub_slurm_sam2bam-par.pl aln1*sam
Here is one of the five jobs:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/clines/hyw154/align
perl /uufs/chpc.utah.edu/common/home/u6000989/data/timema/clines/hyw154/scripts/forkSam2Bam.pl 12 aln16_p0114_2.sam aln16_p0115.sam aln16_p0126_10.sam aln16_p0126_11.sam aln16_p0126_12.sam aln16_p0126_13.sam aln16_p0126_14.sam aln16_p0126_15.sam aln16_p0126_16.sam aln16_p0126_1.sam aln16_p0126_2.sam aln16_p0126_3.sam aln16_p0126_4.sam aln16_p0126_5.sam aln16_p0126_6.sam aln16_p0126_7.sam aln16_p0126_8.sam aln16_p0126_9.sam aln16_p0127_1.sam aln16_p0127_2.sam aln16_p0127_3.sam aln16_p0127_4.sam aln16_p0127_5.sam aln16_p0127_6.sam aln16_p0127_7.sam aln16_p0129_10.sam aln16_p0129_11.sam aln16_p0129_1.sam aln16_p0129_2.sam aln16_p0129_3.sam aln16_p0129_4.sam aln16_p0129_5.sam aln16_p0129_6.sam aln16_p0129_7.sam aln16_p0129_8.sam aln16_p0129_9.sam aln16_p0130_10.sam aln16_p0130_11.sam aln16_p0130_1.sam aln16_p0130_2.sam aln16_p0130_3.sam aln16_p0130_4.sam aln16_p0130_5.sam aln16_p0130_6.sam aln16_p0130_7.sam aln16_p0130_8.sam aln16_p0130_9.sam aln16_p0131_10.sam aln16_p0131_1.sam aln16_p0131_2.sam aln16_p0131_3.sam aln16_p0131_4.sam aln16_p0131_5.sam aln16_p0131_6.sam aln16_p0131_7.sam aln16_p0131_8.sam aln16_p0131_9.sam aln16_p0132_10.sam aln16_p0132_11.sam aln16_p0132_12.sam aln16_p0132_13.sam aln16_p0132_1.sam aln16_p0132_2.sam aln16_p0132_3.sam aln16_p0132_4.sam aln16_p0132_5.sam aln16_p0132_6.sam aln16_p0132_7.sam aln16_p0132_8.sam aln16_p0132_9.sam aln16_p0133_10.sam aln16_p0133_11.sam aln16_p0133_12.sam aln16_p0133_13.sam aln16_p0133_1.sam aln16_p0133_2.sam aln16_p0133_3.sam aln16_p0133_4.sam aln16_p0133_5.sam aln16_p0133_6.sam aln16_p0133_7.sam aln16_p0133_8.sam aln16_p0133_9.sam aln16_p0134_10.sam aln16_p0134_11.sam aln16_p0134_1.sam aln16_p0134_2.sam aln16_p0134_3.sam aln16_p0134_4.sam aln16_p0134_5.sam aln16_p0134_6.sam aln16_p0134_7.sam aln16_p0134_8.sam aln16_p0134_9.sam aln16_p0161_1.sam aln16_p0161_2.sam aln16_p0161_3.sam aln16_p0161_4.sam aln16_p0161_5.sam aln16_p0162_10.sam
And here is the script called, forkSam2Bam.pl
#!/usr/bin/perl
#
# converts sam 2 bam, uses fork
#
use warnings;
use strict;
use Parallel::ForkManager;
my $max = shift(@ARGV); ## get number of cores to use at one time
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach my $file (@ARGV){ ## loop through set of fastq files
$pm->start and next FILES; ## fork
my $out = $file;
$out =~ s/sam// or die "failed sub for $file\n";
system "samtools view -b -S -o $out"."bam $file\n";
system "samtools sort $out"."bam $out"."sorted\n";
system "samtools index $out"."sorted.bam\n";
$pm->finish; ## exit the child process
}
$pm->wait_all_children;
3. Variant calling with bcftools multiallelic/rare variant caller. Note I am using samtools 1.5 and bcftools 1.3. From the align sub-directory, I ran:
sbatch callvar.sh
which executes:
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=samtools
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/clines/hyw154/align
samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 30 -I -u -g -t DP,AD,ADF,ADR -o tcrHwy154Vars.bcf
bcftools call -v -m -p 0.01 -P 0.001 -O v -o tcrHwy154Vars.vcf tcrHwy154Vars.bcf
4. Variant filtering, from the variants sub-directory.
We had 2,361,165 unfiltered variants.
Key parameters in vcfFilter.pl
my $N = 472;
#### stringency variables, edits as desired
my $minCoverage = $N * 2; # minimum number of sequences; DP = 2X
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 0.2 * $N; # maximum number of individuals with no data
my $minMaf = $N * 2 * 0.01; # minimum MAF, as allele count ~ 0.01
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
perl ../scripts/vcfFilter.pl tcrHwy154Vars.vcf
Retained 58658 variable loci in filtered2x_tcrHwy154Vars.vcf
#### stringency variables, edits as desired
my $maxCoverage = 6485; # maximum depth to avoid repeats, mean + 3sd = 1885.44 + 3 (1533.176)
my $mind = 4; # don't call if SNPs are closer together than this
##### this set is for whole genome shotgun data
perl filterSomeMore.pl filtered2x_tcrClineVars.vcf
Retained 45699 variable loci in morefilter_filtered2x_tcrHwy154Vars.vcf.
Convert to gl, retain only common (maf > 5%) SNPs.
perl ../scripts/vcf2gl.pl 0.05 morefilter_filtered2x_tcrHwy154Vars.vcf
The file tcrCline_commonvars.gl contains 18,589 common SNPs with 472 individuals.
5. Split populations and estimate allele frequencies.
I moved the genotype likelihood file to a freqs sub-directory. I then ran splitPops.pl to split this into one gl file per population. I defined populations based on the file ID_codes_by_location_for_Zach.csv from Freddie. Any individual with a population/location ID not in the first column of this file was assigned to population 'NA' and not used for further analysis. This dropped 45 individuals. Here is the script:
#!/usr/bin/perl
#
# splits a genotype likelihood file by populations
#
# usage splitPops.pl in.gl
#
# look-up table
open(IN, "ID_codes_by_location_for_Zach.csv") or die "failed to read table\n";
<IN>;
while(<IN>){
chomp;
@line = split(",",$_);
$pop{$line[0]} = $line[6];
}
close(IN);
my $in = shift (@ARGV);
open(IN, $in) or die "failed to open the infile\n";
## get no. loci, no. inds
$line = <IN>;
chomp($line);
$line =~ m/(\d+)\s+(\d+)/;
$nind = $1;
$nloc = $2;
## get ind. and pop. ids
$line = <IN>;
chomp($line);
@line = split (" ",$line);
foreach $ind (@line){
$ind =~ m/^tcr\-(\S+)/ or die "failed to match here $ind\n";
$p = $1;
$p =~ s/_\d+$//;
print "$p $ind\n";
if(defined $pop{$p}){
$id = $pop{$p};
}
else{
$id = 'NA';
}
push (@id,$id);
push (@{$popids{$id}},$ind);
$ids{$id} = 1;
if(defined $popn{$id}){
$popn{$id}++;
}
else {
$popn{$id} = 1;
}
}
## open one file per population
foreach $id (sort keys %ids){
$fh = "F"."$id";
open ($fh, "> $id".".gl") or die "Could not write $id\n";
$files{$id} = $fh;
print {$files{$id}} "$popn{$id} $nloc\n";
$pids = join (" ",@{$popids{$id}});
print {$files{$id}} "$pids\n";
@ones = ();
for($i=0;$i<$popn{$id}; $i++){
push (@ones,1);
}
$ones = join (" ", @ones);
print {$files{$id}} "$ones\n";
}
## read and write
while (<IN>){
chomp;
@line = split (" ",$_);
$a = shift(@line); ## locus info
foreach $id (sort keys %ids){
print {$files{$id}} "$a";
}
for ($i=0; $i<$nind; $i++){
$id = $id[$i];
for ($j=0; $j<3; $j++){
$a = shift(@line);
print {$files{$id}} " $a";
}
}
foreach $id (sort keys %ids){
print {$files{$id}} "\n";
}
}
close (IN);
I then used my EM maximum likelihood inference program (estpEM) to obtain allele frequency estimates. This was done with the wrapper script runEstp.pl, which iterates over gl files:
#!/usr/bin/perl
#
#
foreach $pop (@ARGV){
$out = $pop;
$out =~ s/gl/txt/;
$out = "p_$out";
print "estpEM -i $pop -o $out -e 0.001 -m 20 -h 2\n";
system "estpEM -i $pop -o $out -e 0.001 -m 20 -h 2\n";
}
This generates p_* files, one per population with the allele frequency estimates (column 3).
6. Combine the frequency estimates and sub-sample by LG, scaffold, etc.
Here is the R code (formatFile.R) for generating the final data for Freddie. Sample sizes and locus IDs were first extracted (e.g., cut, grep) from the gl files.
## genetic data
L<-18589
p<-matrix(NA,nrow=L,ncol=16)
files<-c("p_A.txt","p_B.txt","p_C.txt","p_D.txt","p_E.txt","p_F.txt","p_G.txt","p_H.txt","p_I.txt","p_J.txt","p_K.txt","p_L.txt","p_M.txt","p_N.txt","p_O.txt","p_P.txt")
for(i in 1:16){
x<-read.table(files[i],header=FALSE)
p[,i]<-as.numeric(x[,3])
}
snps<-read.table("snpids.txt",header=F)
## outfiles
## all snps
pops<-LETTERS[1:16]
N<-scan("samSize.txt")
oN<-cbind(pops,N)
write.table(oN,"SampleSize.csv",sep=",",row.names=F,col.names=T,quote=F)
out1<-cbind(snps,p)
colnames(out1)<-c("lg","scaf","pos",pops)
write.table(out1,"tcrClineHwy154Gbs.csv",sep=",",row.names=F,col.names=T)
## subset, lg8
x<-which(out1$lg==8)
write.table(out1[x,],"tcrClineHwy154GbsLg8.csv",sep=",",row.names=F,col.names=T)
## scaffold 128
x<-which(out1$scaf==128)
write.table(out1[x,],"tcrClineHwy154GbsSc128.csv",sep=",",row.names=F,col.names=T)
## 100 snps from 128
xx<-sample(x,100,replace=FALSE)
write.table(out1[xx,],"tcrClineHwy154GbsSc128sub.csv",sep=",",row.names=F,col.names=T)
## 100 random snps (not on lg8)
x<-which(out1[,1]!=8)
xx<-sample(x,100,replace=FALSE)
write.table(out1[xx,],"tcrClineHwy154GbsRan100.csv",sep=",",row.names=F,col.names=T)
save(list=ls(),file="tcrClineHwy154.rdat")