Post date: Oct 14, 2016 11:2:28 PM
We want to look for evidence of exceptional genomic change for scaffolds 128 and 702.1 based on whole genome sequencing of the ecology letters experiment bugs. The data are in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_wgrs/plate*/. The old assemblies are in the file indIds.txt within the assembliesExperiment sub-directory. I am submitting the files for one plate at a time (from within the plate sub-directory). I need at least some files from plates 0-5, but maybe not all (the script should only submit necessary jobs).
1. Here is the command for the 376 samples from plate 1.
perl ~/data/timema/combind_wgs_dovetailV3/scripts/wrap_qsub_slurm_bwa-wgs.pl WTCHG_6*1.fastq.gz
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:TC_1A_24630-61246\tPL:ILLUMINA\tLB:TC_1A_24630\tSM:TC_1A_24630' /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/map_timema_06Jun2016_RvNkF702.fasta /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_wgrs/plate1/WTCHG_61246_296_1.fastq.gz /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_wgrs/plate1/WTCHG_61246_296_2.fastq.gz > /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/aln_1_61246_TC_1A_24630.sam 2> /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/error_1_61246_TC_1A_24630.log
The results will be in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/.
I finished successfully aligning the data from all six plates (0-5) resulting in 2058 sam files.
2. I am now compressing, sorting and indexing these, e.g.,
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/
samtools view -b -S -o aln_5_68110_TC_5C_25090.bam aln_5_68110_TC_5C_25090.sam
samtools sort aln_5_68110_TC_5C_25090.bam aln_5_68110_TC_5C_25090.sorted
samtools index aln_5_68110_TC_5C_25090.sorted.bam
3. Next, I marked potential PCR duplicates using Picard tools. Note that I submitted this as a single batch job but ran a forked perl script.
sbatch subMarkDup.sh
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=markdup
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo -n 'Job is running on node '; cat $SLURM_JOB_NODELIST
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
module load jdk/1.8.0_25
module load gcc
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/
perl makrDups.pl aln*sorted.bam
Which runs,
#!/usr/bin/perl
#
foreach $i (1..6){
$pid = fork;
if($pid){
$forks++;
$lb=0+($i-1) * 343;
$ub=342+($i-1) * 343;
foreach $j ($lb..$ub){
$bam = $ARGV[$j];
print "dedupping $bam\n";
$out = $bam;
$out =~ s/sorted/dedup/ or die "failed sub here: $out\n";
$err = $out;
$err =~ s/bam/log.txt/ or die "failed sub here: $err\n";
system "java -Xmx96g -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar MarkDuplicates INPUT=$bam OUTPUT=$out METRICS_FILE=$err\n";
}
exit;
}
}
for (1..$forks){
$pid = wait();
print "Parent saw $pid exit\n";
}
4. First round of variant calling for the 491 samples (need to remind myself why this isn't 500). Note that HaplotypeCaller does not support multiple threads. Instead, there is a new strategy proposed by the Broad Institute folks to running the HaplotypeCaller on each sample independently, produce an intermediate *.g.vcf file (this includes genotype likelihoods), and then to combine samples for variant calling afterwards. That is what I am trying. Here is the wrapper perl script and example (note that each infile is a file that list the four bam files for a sperm sample). I made the list files and indexed the bams with perl scripts before running this main script.
First index the reference with Picard tools:
java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=map_timema_06Jun2016_RvNkF702.fasta O=map_timema_06Jun2016_RvNkF702.dict
Then run the main script NEED TO REMOVE = IN SCAFFOLD NAMES WILL NOT RUN!!!!
I fixed this like this: perl fixAllSams.pl aln_*dedup.bam
And I am using the mod* version of the genome that also fixes this. Basically I replaced the =s with -s.
perl ../Scripts/wrap_qsub_slurm_gatk.pl files_*
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp
java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/mod_map_timema_06Jun2016_RvNkF702.fasta -I /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/alignments_ecolexp/files_TC_5C_25093.list -o /scratch/general/lustre/tcrGatk/TC_5C_25093.g.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
cp /scratch/general/lustre/tcrGatk/TC_5C_25093.g.vcf /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/variants_ecolexp/TC_5C_25093.g.vcf
5. Joint variant calling based on the g.vcf files in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/variants_ecolexp/.
sbatch jointVarCall.sh
#!/bin/sh
#SBATCH --time=200:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/variants_ecolexp
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/mod_map_timema_06Jun2016_RvNkF702.fasta --variant TC_1A_24594.g.vcf --variant TC_1A_24595.g.vcf --variant TC_1A_24596.g.vcf --variant TC_1A_24597.g.vcf --variant TC_1A_24598.g.vcf --variant TC_1A_24599.g.vcf --variant TC_1A_24600.g.vcf --variant TC_1A_24602.g.vcf --variant TC_1A_24603.g.vcf --variant TC_1A_24604.g.vcf --variant TC_1A_24605.g.vcf --variant TC_1A_24606.g.vcf --variant TC_1A_24607.g.vcf --variant TC_1A_24608.g.vcf --variant TC_1A_24609.g.vcf --variant TC_1A_24610.g.vcf --variant TC_1A_24611.g.vcf --variant TC_1A_24612.g.vcf --variant TC_1A_24613.g.vcf --variant TC_1A_24614.g.vcf --variant TC_1A_24615.g.vcf --variant TC_1A_24616.g.vcf --variant TC_1A_24617.g.vcf --variant TC_1A_24618.g.vcf --variant TC_1A_24619.g.vcf --variant TC_1A_24620.g.vcf --variant TC_1A_24621.g.vcf --variant TC_1A_24622.g.vcf --variant TC_1A_24623.g.vcf --variant TC_1A_24624.g.vcf --variant TC_1A_24625.g.vcf --variant TC_1A_24626.g.vcf --variant TC_1A_24627.g.vcf --variant TC_1A_24628.g.vcf --variant TC_1A_24629.g.vcf --variant TC_1A_24630.g.vcf --variant TC_1A_24631.g.vcf --variant TC_1A_24632.g.vcf --variant TC_1A_24633.g.vcf --variant TC_1A_24634.g.vcf --variant TC_1A_24635.g.vcf --variant TC_1A_24636.g.vcf --variant TC_1A_24637.g.vcf --variant TC_1A_24638.g.vcf --variant TC_1A_24639.g.vcf --variant TC_1A_24640.g.vcf --variant TC_1A_24641.g.vcf --variant TC_1A_24642.g.vcf --variant TC_1A_24643.g.vcf --variant TC_1C_24644.g.vcf --variant TC_1C_24645.g.vcf --variant TC_1C_24646.g.vcf --variant TC_1C_24647.g.vcf --variant TC_1C_24648.g.vcf --variant TC_1C_24649.g.vcf --variant TC_1C_24650.g.vcf --variant TC_1C_24651.g.vcf --variant TC_1C_24652.g.vcf --variant TC_1C_24653.g.vcf --variant TC_1C_24654.g.vcf --variant TC_1C_24655.g.vcf --variant TC_1C_24656.g.vcf --variant TC_1C_24657.g.vcf --variant TC_1C_24658.g.vcf --variant TC_1C_24660.g.vcf --variant TC_1C_24661.g.vcf --variant TC_1C_24662.g.vcf --variant TC_1C_24663.g.vcf --variant TC_1C_24664.g.vcf --variant TC_1C_24665.g.vcf --variant TC_1C_24666.g.vcf --variant TC_1C_24667.g.vcf --variant TC_1C_24668.g.vcf --variant TC_1C_24669.g.vcf --variant TC_1C_24670.g.vcf --variant TC_1C_24671.g.vcf --variant TC_1C_24672.g.vcf --variant TC_1C_24673.g.vcf --variant TC_1C_24674.g.vcf --variant TC_1C_24675.g.vcf --variant TC_1C_24676.g.vcf --variant TC_1C_24677.g.vcf --variant TC_1C_24678.g.vcf --variant TC_1C_24679.g.vcf --variant TC_1C_24680.g.vcf --variant TC_1C_24681.g.vcf --variant TC_1C_24682.g.vcf --variant TC_1C_24683.g.vcf --variant TC_1C_24684.g.vcf --variant TC_1C_24685.g.vcf --variant TC_1C_24687.g.vcf --variant TC_1C_24688.g.vcf --variant TC_1C_24689.g.vcf --variant TC_1C_24690.g.vcf --variant TC_1C_24691.g.vcf --variant TC_1C_24692.g.vcf --variant TC_1C_24693.g.vcf --variant TC_2A_24694.g.vcf --variant TC_2A_24695.g.vcf --variant TC_2A_24696.g.vcf --variant TC_2A_24697.g.vcf --variant TC_2A_24698.g.vcf --variant TC_2A_24699.g.vcf --variant TC_2A_24700.g.vcf --variant TC_2A_24701.g.vcf --variant TC_2A_24702.g.vcf --variant TC_2A_24703.g.vcf --variant TC_2A_24704.g.vcf --variant TC_2A_24705.g.vcf --variant TC_2A_24706.g.vcf --variant TC_2A_24707.g.vcf --variant TC_2A_24708.g.vcf --variant TC_2A_24709.g.vcf --variant TC_2A_24710.g.vcf --variant TC_2A_24711.g.vcf --variant TC_2A_24712.g.vcf --variant TC_2A_24713.g.vcf --variant TC_2A_24714.g.vcf --variant TC_2A_24715.g.vcf --variant TC_2A_24716.g.vcf --variant TC_2A_24717.g.vcf --variant TC_2A_24718.g.vcf --variant TC_2A_24719.g.vcf --variant TC_2A_24720.g.vcf --variant TC_2A_24721.g.vcf --variant TC_2A_24722.g.vcf --variant TC_2A_24723.g.vcf --variant TC_2A_24724.g.vcf --variant TC_2A_24726.g.vcf --variant TC_2A_24727.g.vcf --variant TC_2A_24728.g.vcf --variant TC_2A_24729.g.vcf --variant TC_2A_24730.g.vcf --variant TC_2A_24731.g.vcf --variant TC_2A_24732.g.vcf --variant TC_2A_24733.g.vcf --variant TC_2A_24734.g.vcf --variant TC_2A_24735.g.vcf --variant TC_2A_24736.g.vcf --variant TC_2A_24737.g.vcf --variant TC_2A_24738.g.vcf --variant TC_2A_24739.g.vcf --variant TC_2A_24740.g.vcf --variant TC_2A_24741.g.vcf --variant TC_2A_24742.g.vcf --variant TC_2A_24743.g.vcf --variant TC_2C_24744.g.vcf --variant TC_2C_24745.g.vcf --variant TC_2C_24746.g.vcf --variant TC_2C_24747.g.vcf --variant TC_2C_24748.g.vcf --variant TC_2C_24749.g.vcf --variant TC_2C_24750.g.vcf --variant TC_2C_24751.g.vcf --variant TC_2C_24752.g.vcf --variant TC_2C_24753.g.vcf --variant TC_2C_24754.g.vcf --variant TC_2C_24755.g.vcf --variant TC_2C_24756.g.vcf --variant TC_2C_24757.g.vcf --variant TC_2C_24758.g.vcf --variant TC_2C_24759.g.vcf --variant TC_2C_24760.g.vcf --variant TC_2C_24761.g.vcf --variant TC_2C_24762.g.vcf --variant TC_2C_24763.g.vcf --variant TC_2C_24764.g.vcf --variant TC_2C_24765.g.vcf --variant TC_2C_24766.g.vcf --variant TC_2C_24767.g.vcf --variant TC_2C_24768.g.vcf --variant TC_2C_24769.g.vcf --variant TC_2C_24770.g.vcf --variant TC_2C_24771.g.vcf --variant TC_2C_24772.g.vcf --variant TC_2C_24773.g.vcf --variant TC_2C_24774.g.vcf --variant TC_2C_24775.g.vcf --variant TC_2C_24776.g.vcf --variant TC_2C_24777.g.vcf --variant TC_2C_24778.g.vcf --variant TC_2C_24779.g.vcf --variant TC_2C_24780.g.vcf --variant TC_2C_24781.g.vcf --variant TC_2C_24782.g.vcf --variant TC_2C_24783.g.vcf --variant TC_2C_24784.g.vcf --variant TC_2C_24785.g.vcf --variant TC_2C_24786.g.vcf --variant TC_2C_24787.g.vcf --variant TC_2C_24788.g.vcf --variant TC_2C_24789.g.vcf --variant TC_2C_24790.g.vcf --variant TC_2C_24791.g.vcf --variant TC_2C_24792.g.vcf --variant TC_2C_24793.g.vcf --variant TC_3A_24795.g.vcf --variant TC_3A_24796.g.vcf --variant TC_3A_24797.g.vcf --variant TC_3A_24798.g.vcf --variant TC_3A_24799.g.vcf --variant TC_3A_24800.g.vcf --variant TC_3A_24801.g.vcf --variant TC_3A_24802.g.vcf --variant TC_3A_24803.g.vcf --variant TC_3A_24804.g.vcf --variant TC_3A_24805.g.vcf --variant TC_3A_24806.g.vcf --variant TC_3A_24807.g.vcf --variant TC_3A_24808.g.vcf --variant TC_3A_24809.g.vcf --variant TC_3A_24810.g.vcf --variant TC_3A_24811.g.vcf --variant TC_3A_24812.g.vcf --variant TC_3A_24813.g.vcf --variant TC_3A_24814.g.vcf --variant TC_3A_24815.g.vcf --variant TC_3A_24816.g.vcf --variant TC_3A_24817.g.vcf --variant TC_3A_24818.g.vcf --variant TC_3A_24819.g.vcf --variant TC_3A_24820.g.vcf --variant TC_3A_24821.g.vcf --variant TC_3A_24822.g.vcf --variant TC_3A_24823.g.vcf --variant TC_3A_24824.g.vcf --variant TC_3A_24825.g.vcf --variant TC_3A_24826.g.vcf --variant TC_3A_24827.g.vcf --variant TC_3A_24828.g.vcf --variant TC_3A_24829.g.vcf --variant TC_3A_24830.g.vcf --variant TC_3A_24831.g.vcf --variant TC_3A_24832.g.vcf --variant TC_3A_24833.g.vcf --variant TC_3A_24834.g.vcf --variant TC_3A_24835.g.vcf --variant TC_3A_24836.g.vcf --variant TC_3A_24837.g.vcf --variant TC_3A_24838.g.vcf --variant TC_3A_24839.g.vcf --variant TC_3A_24840.g.vcf --variant TC_3A_24841.g.vcf --variant TC_3A_24842.g.vcf --variant TC_3A_24843.g.vcf --variant TC_3C_24844.g.vcf --variant TC_3C_24845.g.vcf --variant TC_3C_24846.g.vcf --variant TC_3C_24847.g.vcf --variant TC_3C_24848.g.vcf --variant TC_3C_24849.g.vcf --variant TC_3C_24850.g.vcf --variant TC_3C_24851.g.vcf --variant TC_3C_24852.g.vcf --variant TC_3C_24853.g.vcf --variant TC_3C_24854.g.vcf --variant TC_3C_24855.g.vcf --variant TC_3C_24856.g.vcf --variant TC_3C_24857.g.vcf --variant TC_3C_24858.g.vcf --variant TC_3C_24859.g.vcf --variant TC_3C_24860.g.vcf --variant TC_3C_24861.g.vcf --variant TC_3C_24862.g.vcf --variant TC_3C_24863.g.vcf --variant TC_3C_24864.g.vcf --variant TC_3C_24865.g.vcf --variant TC_3C_24866.g.vcf --variant TC_3C_24867.g.vcf --variant TC_3C_24868.g.vcf --variant TC_3C_24869.g.vcf --variant TC_3C_24870.g.vcf --variant TC_3C_24871.g.vcf --variant TC_3C_24872.g.vcf --variant TC_3C_24873.g.vcf --variant TC_3C_24874.g.vcf --variant TC_3C_24875.g.vcf --variant TC_3C_24876.g.vcf --variant TC_3C_24877.g.vcf --variant TC_3C_24879.g.vcf --variant TC_3C_24880.g.vcf --variant TC_3C_24881.g.vcf --variant TC_3C_24882.g.vcf --variant TC_3C_24883.g.vcf --variant TC_3C_24884.g.vcf --variant TC_3C_24885.g.vcf --variant TC_3C_24886.g.vcf --variant TC_3C_24887.g.vcf --variant TC_3C_24888.g.vcf --variant TC_3C_24889.g.vcf --variant TC_3C_24890.g.vcf --variant TC_3C_24891.g.vcf --variant TC_3C_24892.g.vcf --variant TC_3C_24893.g.vcf --variant TC_4A_24894.g.vcf --variant TC_4A_24895.g.vcf --variant TC_4A_24896.g.vcf --variant TC_4A_24897.g.vcf --variant TC_4A_24898.g.vcf --variant TC_4A_24899.g.vcf --variant TC_4A_24900.g.vcf --variant TC_4A_24901.g.vcf --variant TC_4A_24902.g.vcf --variant TC_4A_24903.g.vcf --variant TC_4A_24904.g.vcf --variant TC_4A_24905.g.vcf --variant TC_4A_24906.g.vcf --variant TC_4A_24907.g.vcf --variant TC_4A_24908.g.vcf --variant TC_4A_24909.g.vcf --variant TC_4A_24910.g.vcf --variant TC_4A_24911.g.vcf --variant TC_4A_24912.g.vcf --variant TC_4A_24913.g.vcf --variant TC_4A_24914.g.vcf --variant TC_4A_24915.g.vcf --variant TC_4A_24916.g.vcf --variant TC_4A_24917.g.vcf --variant TC_4A_24918.g.vcf --variant TC_4A_24919.g.vcf --variant TC_4A_24920.g.vcf --variant TC_4A_24921.g.vcf --variant TC_4A_24922.g.vcf --variant TC_4A_24923.g.vcf --variant TC_4A_24924.g.vcf --variant TC_4A_24925.g.vcf --variant TC_4A_24926.g.vcf --variant TC_4A_24927.g.vcf --variant TC_4A_24928.g.vcf --variant TC_4A_24929.g.vcf --variant TC_4A_24930.g.vcf --variant TC_4A_24932.g.vcf --variant TC_4A_24933.g.vcf --variant TC_4A_24934.g.vcf --variant TC_4A_24935.g.vcf --variant TC_4A_24936.g.vcf --variant TC_4A_24937.g.vcf --variant TC_4A_24938.g.vcf --variant TC_4A_24939.g.vcf --variant TC_4A_24940.g.vcf --variant TC_4A_24941.g.vcf --variant TC_4A_24942.g.vcf --variant TC_4A_24943.g.vcf --variant TC_4C_24944.g.vcf --variant TC_4C_24945.g.vcf --variant TC_4C_24946.g.vcf --variant TC_4C_24948.g.vcf --variant TC_4C_24949.g.vcf --variant TC_4C_24950.g.vcf --variant TC_4C_24951.g.vcf --variant TC_4C_24952.g.vcf --variant TC_4C_24953.g.vcf --variant TC_4C_24954.g.vcf --variant TC_4C_24955.g.vcf --variant TC_4C_24956.g.vcf --variant TC_4C_24957.g.vcf --variant TC_4C_24958.g.vcf --variant TC_4C_24959.g.vcf --variant TC_4C_24960.g.vcf --variant TC_4C_24961.g.vcf --variant TC_4C_24962.g.vcf --variant TC_4C_24963.g.vcf --variant TC_4C_24964.g.vcf --variant TC_4C_24965.g.vcf --variant TC_4C_24966.g.vcf --variant TC_4C_24967.g.vcf --variant TC_4C_24968.g.vcf --variant TC_4C_24969.g.vcf --variant TC_4C_24970.g.vcf --variant TC_4C_24971.g.vcf --variant TC_4C_24972.g.vcf --variant TC_4C_24973.g.vcf --variant TC_4C_24974.g.vcf --variant TC_4C_24975.g.vcf --variant TC_4C_24976.g.vcf --variant TC_4C_24977.g.vcf --variant TC_4C_24978.g.vcf --variant TC_4C_24979.g.vcf --variant TC_4C_24980.g.vcf --variant TC_4C_24981.g.vcf --variant TC_4C_24982.g.vcf --variant TC_4C_24983.g.vcf --variant TC_4C_24984.g.vcf --variant TC_4C_24985.g.vcf --variant TC_4C_24986.g.vcf --variant TC_4C_24987.g.vcf --variant TC_4C_24988.g.vcf --variant TC_4C_24989.g.vcf --variant TC_4C_24990.g.vcf --variant TC_4C_24991.g.vcf --variant TC_4C_24992.g.vcf --variant TC_4C_24993.g.vcf --variant TC_5A_24994.g.vcf --variant TC_5A_24995.g.vcf --variant TC_5A_24996.g.vcf --variant TC_5A_24997.g.vcf --variant TC_5A_24998.g.vcf --variant TC_5A_24999.g.vcf --variant TC_5A_25000.g.vcf --variant TC_5A_25001.g.vcf --variant TC_5A_25002.g.vcf --variant TC_5A_25003.g.vcf --variant TC_5A_25004.g.vcf --variant TC_5A_25005.g.vcf --variant TC_5A_25006.g.vcf --variant TC_5A_25007.g.vcf --variant TC_5A_25008.g.vcf --variant TC_5A_25009.g.vcf --variant TC_5A_25010.g.vcf --variant TC_5A_25011.g.vcf --variant TC_5A_25012.g.vcf --variant TC_5A_25013.g.vcf --variant TC_5A_25014.g.vcf --variant TC_5A_25015.g.vcf --variant TC_5A_25016.g.vcf --variant TC_5A_25017.g.vcf --variant TC_5A_25018.g.vcf --variant TC_5A_25019.g.vcf --variant TC_5A_25020.g.vcf --variant TC_5A_25021.g.vcf --variant TC_5A_25022.g.vcf --variant TC_5A_25023.g.vcf --variant TC_5A_25024.g.vcf --variant TC_5A_25025.g.vcf --variant TC_5A_25026.g.vcf --variant TC_5A_25027.g.vcf --variant TC_5A_25028.g.vcf --variant TC_5A_25029.g.vcf --variant TC_5A_25030.g.vcf --variant TC_5A_25031.g.vcf --variant TC_5A_25032.g.vcf --variant TC_5A_25033.g.vcf --variant TC_5A_25034.g.vcf --variant TC_5A_25035.g.vcf --variant TC_5A_25036.g.vcf --variant TC_5A_25037.g.vcf --variant TC_5A_25038.g.vcf --variant TC_5A_25039.g.vcf --variant TC_5A_25040.g.vcf --variant TC_5A_25041.g.vcf --variant TC_5A_25042.g.vcf --variant TC_5A_25043.g.vcf --variant TC_5C_25044.g.vcf --variant TC_5C_25045.g.vcf --variant TC_5C_25046.g.vcf --variant TC_5C_25047.g.vcf --variant TC_5C_25048.g.vcf --variant TC_5C_25049.g.vcf --variant TC_5C_25050.g.vcf --variant TC_5C_25052.g.vcf --variant TC_5C_25053.g.vcf --variant TC_5C_25054.g.vcf --variant TC_5C_25055.g.vcf --variant TC_5C_25056.g.vcf --variant TC_5C_25057.g.vcf --variant TC_5C_25058.g.vcf --variant TC_5C_25059.g.vcf --variant TC_5C_25060.g.vcf --variant TC_5C_25061.g.vcf --variant TC_5C_25062.g.vcf --variant TC_5C_25063.g.vcf --variant TC_5C_25064.g.vcf --variant TC_5C_25065.g.vcf --variant TC_5C_25066.g.vcf --variant TC_5C_25067.g.vcf --variant TC_5C_25068.g.vcf --variant TC_5C_25069.g.vcf --variant TC_5C_25070.g.vcf --variant TC_5C_25071.g.vcf --variant TC_5C_25072.g.vcf --variant TC_5C_25073.g.vcf --variant TC_5C_25074.g.vcf --variant TC_5C_25075.g.vcf --variant TC_5C_25076.g.vcf --variant TC_5C_25077.g.vcf --variant TC_5C_25078.g.vcf --variant TC_5C_25079.g.vcf --variant TC_5C_25080.g.vcf --variant TC_5C_25081.g.vcf --variant TC_5C_25082.g.vcf --variant TC_5C_25083.g.vcf --variant TC_5C_25084.g.vcf --variant TC_5C_25085.g.vcf --variant TC_5C_25086.g.vcf --variant TC_5C_25087.g.vcf --variant TC_5C_25088.g.vcf --variant TC_5C_25089.g.vcf --variant TC_5C_25090.g.vcf --variant TC_5C_25091.g.vcf --variant TC_5C_25092.g.vcf --variant TC_5C_25093.g.vcf -ploidy 2 -o tcrExperimentVariants.vcf
6. Variant filtering to remove low-quality variants. I did this in a few steps, using the perl scripts in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/variants_ecolexp/
(a) first step ran:
perl vcfFilter.pl tcrExperimentVariants.vcf
with
my $minCoverage = 491; # minimum number of sequences; DP
my $bqrs = -8; # minimum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = -12.5; # minimum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = -8; # minimum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 40; # minimum mapping quality; MQ
my $fish = 60; #Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
I updated these based on GATK recommendaitions. Note that this also gets rid of indels and multi-allelic SNPs. Mapping quality cost me the most SNPs.
(b) then, on filtered1X_tcrExperimentVariants.vcf I ran:
perl vcfFilter2.pl filtered1X_tcrExperimentVariants.vcf
This gets rid of LG0=NA SNPs, those with fewer than 10 non-reference alleles and those that re nearly fixed (p non-ref > 0.999).
This left me with 6,202,750 SNPs. That is what I will work with, and the results are in lgvar_filtered1X_tcrExperimentVariants.vcf, which I am copying to the project folder, here:
/uufs/chpc.utah.edu/common/home/u6000989/projects/timema_fluct/genomic_change_dark_morph/variants