Post date: Sep 08, 2015 6:10:1 PM
Two things were evident from initial variant calling: (1) nearly all SNPs were represented by one or a few sperm, (2) in the small number of SNPs with data for many sperm (40+) most exhibited 'allele freqs.' far from the expected value of 0.5. The former is perhaps to be expected for sperm sequencing, and we would potentially still be OK if it wasn't for the latter issue. Oxford found similar results when they analyzed these data (for 1 only, they didn't look at 2). At this time they informed me that four of the 'sperm samples' were in fact bulk samples with many sperm. I am now seeing what happens if I just use these four samples for variant calling (which is kind of like calling variants from the sperm donor). It might help parse true variants from artefacts, and thus let me focus on true variants that hopefully have more reasonable allele frequencies in the 92 individual sperm.
Here is the command I ran (job_UnifiedGenotyper.sh):
#!/bin/sh
#PBS -N gatkUnifiedGenotyper
#PBS -l nodes=1:ppn=32
#PBS -l walltime=96:00:00
#PBS -l mem=480g
#PBS -q batch
. /rc/tools/utils/dkinit
reuse GATK
cd /labs/evolution/data/timema/sperm/Alignments/
java -jar -Xmx420g -Djava.io.tmpdir=/pscratch/A01963476/ /rc/tools/free/redhat_6_x86_64/gatk-3.1.1/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /labs/evolution/data/timema/draft_genome/draft0.3/mod_lg_timemaGenome.fasta -I bamsbulk.list -o sperm_bulkvariants.vcf -nt 32 -glm SNP -hets 0.001 -mbq 20 -ploidy 2 -stand_call_conf 50 -maxAltAlleles 2