Post date: Sep 11, 2019 10:18:4 PM
We have sequence data from Timema chumash used in a mark-release-recpature experiment testing for selection on color-associated SNPs. Patrik carried out the experiment and Tom generated the data (via UT). The data (four lanes) are in /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs.
Here are the files (numbers = # lines = 4x # sequences):
wc -l TM*fastq
1112778600 TM2_S2_L005_R1_001.fastq
1124734296 TM2_S2_L006_R1_001.fastq
1033291312 TM2_S2_L007_R1_001.fastq
1105705924 TM2_S2_L008_R1_001.fastq
4376510132 total
I am processing them in the same way that I processed the recent T. knulli and T. petita data (Sequence data, 2019 redwood experiment plus).
1. Filter for PhiX.
sbatch SubClean
which runs PhiXFilterFork.pl
#!/usr/bin/perl
#
# filter phix by aligning to the phix reference, uses fork
#
system "module load bwa\n";
system "module load samtools\n";
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
my $phix = "/uufs/chpc.utah.edu/common/home/u6000989/data/phixSeqIndex/NC_001422.fasta";
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
$fq =~ m/^([a-zA-Z_\-0-9]+)/ or die "failed to match $fq\n";
$base = $1;
print "Running alignment for $fq to phix\n";
system "bwa aln -n 5 -l 20 -k 2 -t 4 $phix $fq -f $base.sai\n";
system "bwa samse -f $base.sam -r \'\@RG\\tID:PHIX\' $phix $base.sai $fq\n";
system "samtools view -S -b $base.sam > $base.bam\n";
print "Removing reads mapped to phix for $base\n";
system "samtools view -f4 $base.bam > $base.sam\n";
system "samtools view -S -b $base.sam > $base.bam\n";
print "Creating filtered fastq for $base\n";
system "samtools bam2fq $base.bam > clean_$base.fastq\n";
$pm->finish;
}
$pm->wait_all_children;
2. Split the clean files into smaller chunks for parsing.
split -l 40000000 clean_TM2_S2_L005_R1_001.fastq x_TM2_S2_L005 &
split -l 40000000 clean_TM2_S2_L006_R1_001.fastq x_TM2_S2_L006 &
split -l 40000000 clean_TM2_S2_L007_R1_001.fastq x_TM2_S2_L007 &
split -l 40000000 clean_TM2_S2_L008_R1_001.fastq x_TM2_S2_L008 &
3. Run parse barcodes (barcode plate 2 for all).
sbatch SubParse.sh
which runs RunParseFork.pl
#!/usr/bin/perl
#
# manage barcode parsing
#
use Parallel::ForkManager;
my $max = 24;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
$fq =~ m/x_TM2_S2_L00([0-9])/ or die "failed to match $fq\n";
$lane = $1;
$bc = "Timema2bc.csv";
print "Parsing $fq -- $lane with $bc\n";
system "perl ./parse_barcodes768.pl $bc $fq\n";
$pm->finish;
}
4. Alignment to the green-striped HiC genome with bwa mem (version 0.7.17-r1188). This is from the Parsed subdirectory.
Run,
perl wrap_qsub_slurm_bwa.pl 19_*fastq
Here is an example command.
cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Parsed
bwa mem -t 8 -k 15 -r 1.3 -T 30 -r '@RG\tID:Timema-19_0884\tPL:ILLUMINA\tLB:Timema-19_0884\tSM:Timema-19_0884' /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta 19_0884.fastq > aln19_0884.sam
5. Use samtools to compress, sort and index the alignments. This was run within the Alignments subdirectory.
Run,
perl wrap_qsub_slurm_sam2bam.pl *sam
Here is an example command:
cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Alignments
samtools view -b -o aln19_0884.bam aln19_0884.sam
samtools sort -o aln19_0884.sorted.bam aln19_0884.bam
samtools index -b aln19_0884.sorted.bam
I then deleted the sam files and unsorted bam files, leaving only the sorted bam files and their index files.
I summarized the alignments with getMappedPct.pl which generated mappedRdsPct.txt. For most samples, 65-69% of reads mapped to the genome, however some had very few reads map, and these tended to be samples with few reads total.
6. Variant calling with samtools 1.5 and bcftools 1.6. Working in the Alignments subdirectory. Run SamVarCallTchum.sh. This calls variants for samples in bamfiles. Note I dropped one individual where the same ID had been assigned to two barcodes.
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=sam_varcall
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
#SAMTOOLs mpileup version 1.5 options used:
#C = adjust mapping quality; recommended:50, disable:0 [0]
#d = max per-file depth; avoids excessive memory usage [250]
#f = faidx indexed reference sequence file
#q = skip alignments with mapQ smaller than INT [0]
#Q = skip bases with baseQ/BAQ smaller than INT [13]
#g = generate genotype likelihoods in BCF format
#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
#BCFTOOLs call version 1.6 options used
#v = output variant sites only
#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)
#p = variant if P(ref|D)<FLOAT with -c [0.5]
#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')
#o = write output to a file [standard output]
module load samtools
module load bcftools
cd /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_chumash_gbs/Alignments
samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/timema/hic_genomes/t_crist/timema_cristinae_12Jun2019_lu3Hs.fasta -q 20 -Q 20 -g -I -t DP,AD -u -b bamfiles -o tchum_gbs_2019.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o tchum_gbs_2019.vcf tchum_gbs_2019.bcf
The resulting variant files are in the Variants subdirectory.