Alignment and variant calling for the 2019 T. chumash experiment GBS data

Post date: Sep 11, 2019 10:18:4 PM

We have sequence data from Timema chumash used in a mark-release-recpature experiment testing for selection on color-associated SNPs. Patrik carried out the experiment and Tom generated the data (via UT). The data (four lanes) are in /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs.

Here are the files (numbers = # lines = 4x # sequences):

wc -l TM*fastq

1112778600 TM2_S2_L005_R1_001.fastq

1124734296 TM2_S2_L006_R1_001.fastq

1033291312 TM2_S2_L007_R1_001.fastq

1105705924 TM2_S2_L008_R1_001.fastq

4376510132 total

I am processing them in the same way that I processed the recent T. knulli and T. petita data (Sequence data, 2019 redwood experiment plus).

1. Filter for PhiX.

sbatch SubClean

which runs PhiXFilterFork.pl

#!/usr/bin/perl

# filter phix by aligning to the phix reference, uses fork

system "module load bwa\n";

system "module load samtools\n";

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

my $phix = "/uufs/chpc.utah.edu/common/home/u6000989/data/phixSeqIndex/NC_001422.fasta";

FILES:

foreach $fq (@ARGV){

$pm->start and next FILES; ## fork

$fq =~ m/^([a-zA-Z_\-0-9]+)/ or die "failed to match $fq\n";

$base = $1;

print "Running alignment for $fq to phix\n";

system "bwa aln -n 5 -l 20 -k 2 -t 4 $phix $fq -f $base.sai\n";

system "bwa samse -f $base.sam -r \'\@RG\\tID:PHIX\' $phix $base.sai $fq\n";

system "samtools view -S -b $base.sam > $base.bam\n";

print "Removing reads mapped to phix for $base\n";

system "samtools view -f4 $base.bam > $base.sam\n";

system "samtools view -S -b $base.sam > $base.bam\n";

print "Creating filtered fastq for $base\n";

system "samtools bam2fq $base.bam > clean_$base.fastq\n";

$pm->finish;

}

$pm->wait_all_children;

2. Split the clean files into smaller chunks for parsing.

split -l 40000000 clean_TM2_S2_L005_R1_001.fastq x_TM2_S2_L005 &

split -l 40000000 clean_TM2_S2_L006_R1_001.fastq x_TM2_S2_L006 &

split -l 40000000 clean_TM2_S2_L007_R1_001.fastq x_TM2_S2_L007 &

split -l 40000000 clean_TM2_S2_L008_R1_001.fastq x_TM2_S2_L008 &

3. Run parse barcodes (barcode plate 2 for all).

sbatch SubParse.sh

which runs RunParseFork.pl

#!/usr/bin/perl

# manage barcode parsing

use Parallel::ForkManager;

my $max = 24;

my $pm = Parallel::ForkManager->new($max);

FILES:

foreach $fq (@ARGV){

$pm->start and next FILES; ## fork

$fq =~ m/x_TM2_S2_L00([0-9])/ or die "failed to match $fq\n";

$lane = $1;

$bc = "Timema2bc.csv";

print "Parsing $fq -- $lane with $bc\n";

system "perl ./parse_barcodes768.pl $bc $fq\n";

$pm->finish;

}

4. Alignment to the green-striped HiC genome with bwa mem (version 0.7.17-r1188). This is from the Parsed subdirectory.

Run,

perl wrap_qsub_slurm_bwa.pl 19_*fastq

Here is an example command.

cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Parsed

bwa mem -t 8 -k 15 -r 1.3 -T 30 -r '@RG\tID:Timema-19_0884\tPL:ILLUMINA\tLB:Timema-19_0884\tSM:Timema-19_0884' /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta 19_0884.fastq > aln19_0884.sam

5. Use samtools to compress, sort and index the alignments. This was run within the Alignments subdirectory.

Run,

perl wrap_qsub_slurm_sam2bam.pl *sam

Here is an example command:

cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Alignments

samtools view -b -o aln19_0884.bam aln19_0884.sam

samtools sort -o aln19_0884.sorted.bam aln19_0884.bam

samtools index -b aln19_0884.sorted.bam

I then deleted the sam files and unsorted bam files, leaving only the sorted bam files and their index files.

I summarized the alignments with getMappedPct.pl which generated mappedRdsPct.txt. For most samples, 65-69% of reads mapped to the genome, however some had very few reads map, and these tended to be samples with few reads total.

6. Variant calling with samtools 1.5 and bcftools 1.6. Working in the Alignments subdirectory. Run SamVarCallTchum.sh. This calls variants for samples in bamfiles. Note I dropped one individual where the same ID had been assigned to two barcodes.

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=24

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=sam_varcall

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

module load samtools

module load bcftools

cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Alignments

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta -q 20 -Q 20 -g -I -t DP,AD -u -b bamfiles -o tchum_gbs_2019.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o tchum_gbs_2019.vcf tchum_gbs_2019.bcf

The resulting variant files are in the Variants subdirectory.