Post date: Aug 13, 2019 2:51:37 AM
I received the GBS data from the 2019 redwood performance experiment (T. knulli and T. petita) plus some data from mostly green T. cristinae that we will use to chose a bug to sequence for a homozygous green genome. The data were generated by Tom via UT. The data, 2 lanes, are in:
/uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Sequences
and are TM_S4_L007_R1_001.fastq and TM_S4_L008_R1_001.fastq
wc -l *fastq
1210989764 TM_S4_L007_R1_001.fastq
1217979148 TM_S4_L008_R1_001.fastq
I recieved barcode files from Tom, Timema1bcodekey.csv and Timema2bcodekey.csv, which I modified to format correctly as Timema1bc.csv and Timema2bc.csv. I think these go with lanes 7 and 8, respectively. Here is what I have done to parse/process the data:
1. Split files into sets of 10 million sequences (40 million lines)
split -l 40000000 TM_S4_L007_R1_001.fastq xTM_S4_L007
split -l 40000000 TM_S4_L008_R1_001.fastq xTM_S4_L008
2. Filter for PhiX
sbatch SubClean.sh
rm x*
This runs PhiXFilterFork.pl
#!/usr/bin/perl
#
# filter phix by aligning to the phix reference, uses fork
#
system "module load bwa\n";
system "module load samtools\n";
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
my $phix = "/uufs/chpc.utah.edu/common/home/u6000989/data/phixSeqIndex/NC_001422.fasta";
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
$fq =~ m/^([a-zA-Z_\-0-9]+)/ or die "failed to match $fq\n";
$base = $1;
print "Running alignment for $fq to phix\n";
system "bwa aln -n 5 -l 20 -k 2 -t 4 $phix $fq -f $base.sai\n";
system "bwa samse -f $base.sam -r \'\@RG\\tID:PHIX\' $phix $base.sai $fq\n";
system "samtools view -S -b $base.sam > $base.bam\n";
print "Removing reads mapped to phix for $base\n";
system "samtools view -f4 $base.bam > $base.sam\n";
system "samtools view -S -b $base.sam > $base.bam\n";
print "Creating filtered fastq for $base\n";
system "samtools bam2fq $base.bam > clean_$base.fastq\n";
$pm->finish;
}
$pm->wait_all_children;
3. Parse the barcode sequences
sbatch SubParse.sh
which runs,
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=parseseq
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load perl
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Sequences
perl RunParseFork.pl clean*fastq
The latter runs the standard barcode parsing script.
4. Indexed the new (Hi-C, stripe) T. cristinae genome.
bwa index -a bwtsw timema_cristinae_12Jun2019_lu3Hs.fasta
5. Align the data from the 311 individuals to the stripe genome with bwa mem using bwa version 0.7.17-r1188.
From /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments
perl wrap_qsub_slurm_bwa.pl 19_0*
This runs all 311 alignments. The command for one looks like this:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments
bwa mem -t 8 -k 15 -r 1.3 -T 30 -r '@RG\tID:Timema-19_0335\tPL:ILLUMINA\tLB:Timema-19_0335\tSM:Timema-19_0335' /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta 19_0335.fastq > aln19_0335.sam
6. Rename, compress, sort and index the alignments. Note that this is the standard step with samtools (version 1.5 with htslib 1.5), but here I am prepending the species names from idTab.txt to the bam files.
perl wrap_qsub_slurm_sam2bam.pl aln19_0*sam
Runs the 311 alignment files, one of which looks like this:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments
samtools view -b -o Tcristinae_19_0335.bam aln19_0335.sam
samtools sort -o Tcristinae_19_0335.sorted.bam Tcristinae_19_0335.bam
samtools index -b Tcristinae_19_0335.sorted.bam
7. Variant calling with samtools (1.5) and bcftools (1.6). From /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/
sbatch SamVarCallTcr.sh
sbatch SamVarCallTknul.sh
sbatch SamVarCallTpet.sh
This differ in the set of bams. Here is the T. knulli example (other than bams, all options are the same for the others):
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=sam_varcall
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
#SAMTOOLs mpileup version 1.5 options used:
#C = adjust mapping quality; recommended:50, disable:0 [0]
#d = max per-file depth; avoids excessive memory usage [250]
#f = faidx indexed reference sequence file
#q = skip alignments with mapQ smaller than INT [0]
#Q = skip bases with baseQ/BAQ smaller than INT [13]
#g = generate genotype likelihoods in BCF format
#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
#BCFTOOLs call version 1.6 options used
#v = output variant sites only
#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)
#p = variant if P(ref|D)<FLOAT with -c [0.5]
#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')
#o = write output to a file [standard output]
module load samtools
module load bcftools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_gbs_2019/Alignments/
samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta -q 20 -Q 20 -g -I -t DP,AD -u -b tknul_bams -o tknul_gbs_2019.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o tknul_gbs_2019.vcf tknul_gbs_2019.bcf