Sequence data, 2019 redwood experiment plus

Post date: Aug 13, 2019 2:51:37 AM

I received the GBS data from the 2019 redwood performance experiment (T. knulli and T. petita) plus some data from mostly green T. cristinae that we will use to chose a bug to sequence for a homozygous green genome. The data were generated by Tom via UT. The data, 2 lanes, are in:

/uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Sequences

and are TM_S4_L007_R1_001.fastq and TM_S4_L008_R1_001.fastq

wc -l *fastq

1210989764 TM_S4_L007_R1_001.fastq

1217979148 TM_S4_L008_R1_001.fastq

I recieved barcode files from Tom, Timema1bcodekey.csv and Timema2bcodekey.csv, which I modified to format correctly as Timema1bc.csv and Timema2bc.csv. I think these go with lanes 7 and 8, respectively. Here is what I have done to parse/process the data:

1. Split files into sets of 10 million sequences (40 million lines)

split -l 40000000 TM_S4_L007_R1_001.fastq xTM_S4_L007

split -l 40000000 TM_S4_L008_R1_001.fastq xTM_S4_L008

2. Filter for PhiX

sbatch SubClean.sh

rm x*

This runs PhiXFilterFork.pl

#!/usr/bin/perl

# filter phix by aligning to the phix reference, uses fork

system "module load bwa\n";

system "module load samtools\n";

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

my $phix = "/uufs/chpc.utah.edu/common/home/u6000989/data/phixSeqIndex/NC_001422.fasta";

FILES:

foreach $fq (@ARGV){

$pm->start and next FILES; ## fork

$fq =~ m/^([a-zA-Z_\-0-9]+)/ or die "failed to match $fq\n";

$base = $1;

print "Running alignment for $fq to phix\n";

system "bwa aln -n 5 -l 20 -k 2 -t 4 $phix $fq -f $base.sai\n";

system "bwa samse -f $base.sam -r \'\@RG\\tID:PHIX\' $phix $base.sai $fq\n";

system "samtools view -S -b $base.sam > $base.bam\n";

print "Removing reads mapped to phix for $base\n";

system "samtools view -f4 $base.bam > $base.sam\n";

system "samtools view -S -b $base.sam > $base.bam\n";

print "Creating filtered fastq for $base\n";

system "samtools bam2fq $base.bam > clean_$base.fastq\n";

$pm->finish;

}

$pm->wait_all_children;

3. Parse the barcode sequences

sbatch SubParse.sh

which runs,

#!/bin/sh

#SBATCH --time=96:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=24

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=parseseq

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

module load perl

cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Sequences

perl RunParseFork.pl clean*fastq

The latter runs the standard barcode parsing script.

4. Indexed the new (Hi-C, stripe) T. cristinae genome.

bwa index -a bwtsw timema_cristinae_12Jun2019_lu3Hs.fasta

5. Align the data from the 311 individuals to the stripe genome with bwa mem using bwa version 0.7.17-r1188.

From /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments

perl wrap_qsub_slurm_bwa.pl 19_0*

This runs all 311 alignments. The command for one looks like this:

cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments

bwa mem -t 8 -k 15 -r 1.3 -T 30 -r '@RG\tID:Timema-19_0335\tPL:ILLUMINA\tLB:Timema-19_0335\tSM:Timema-19_0335' /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta 19_0335.fastq > aln19_0335.sam

6. Rename, compress, sort and index the alignments. Note that this is the standard step with samtools (version 1.5 with htslib 1.5), but here I am prepending the species names from idTab.txt to the bam files.

perl wrap_qsub_slurm_sam2bam.pl aln19_0*sam

Runs the 311 alignment files, one of which looks like this:

cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments

samtools view -b -o Tcristinae_19_0335.bam aln19_0335.sam

samtools sort -o Tcristinae_19_0335.sorted.bam Tcristinae_19_0335.bam

samtools index -b Tcristinae_19_0335.sorted.bam

7. Variant calling with samtools (1.5) and bcftools (1.6). From /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/

sbatch SamVarCallTcr.sh

sbatch SamVarCallTknul.sh

sbatch SamVarCallTpet.sh

This differ in the set of bams. Here is the T. knulli example (other than bams, all options are the same for the others):

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=24

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=sam_varcall

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

module load samtools

module load bcftools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments/

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta -q 20 -Q 20 -g -I -t DP,AD -u -b tknul_bams -o tknul_gbs_2019.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o tknul_gbs_2019.vcf tknul_gbs_2019.bcf