Post date: Dec 10, 2014 9:21:31 PM
All previous ABC scripts are on greenhouse in /Volumes/data/abc.in/
All previous ABC results are on greenhouse in /Volumes/sims/lkl/abc/
Followed notes from 3v12 in red lab notebook.
11Dec14: I ended up copying the following perl scripts and sphapdata to Lycaeides cluster at USU (/labs/evolution/projects/stygoparnus/) because I was crashing greenhouse. To find these files:
ssh A01974358@login.rc.usu.edu
Use same password as USU email.
In /Volumes/data/abc.in/:
1. Changed regular expression in makeHapSeqFil.pl
It was: if(/^S\-P\-CS\-([PAR135SIU]{2})\-(\d+)/){ ###Why is the U in this reg. exp.?
Changed to: if(/^S\-C\-CS\-([PAR12SI]{2})\-(\d+)/){
2. perl makeHapSeqFile.pl /Volumes/data/stygoparnus/assembly/variant_calling/mod_hapseq.txt /Volumes/data/stygoparnus/assembly/variant_calling/mod_hapcnt.txt outfile
mv outfile sphapdata
3. perl getHiIndCovVarHapseq.pl sphapdata 5 ##retains the subset of sequences with data for at least 5 individuals per population. ###Use this script, not getHicovVarHapseq.pl
Produced file called var_sphapdata. Retained 63065 loci with data for 5 inds per population.
I ran this script on USU cluster; to execute one command through compute cluster:
Made a script called sub.sh:
Need to have this stuff on top:
#!/bin/bash
#PBS -N runscript
#PBS -l nodes=1:ppn=1
#PBS -q batch
#PBS -l mem=96g
---
Then this stuff which can be changed for other perl scripts:
cd /labs/evolution/projects/stygoparnus
perl ./getHiIndCovVarHapseq.pl sphapdata 5
Then to execute: qsub sub.sh
To check on status: qstat -u A01974358
To edit scripts in vim:
i edit mode
esc get out of edit mode
:w save
:q quit
4. perl ranSubsetReads.pl var_sphapdata ##This samples 1 read/ind.
#This makes a file called sub_var_sphapdata
5. cntinvariant.pl on sub_var_sphapdata
#This makes a file called var_sub_var_sphapdata
#Found 1885 invariabe sites and 61180 variable sites.
6. Got ZG's help with this: I still need to cut down on the number of loci for Stygoparnus ABC, so we should randomly sample 500 loci at this point to get down to the number we have for the other datasets: 475 loci for Eurycea, 174 loci for Heterelmis, 496 loci for Stygobromus.
-ZG wrote an R script to randomly pick 500 loci out of 61180. He made a text files designating which ones to throw away (0) and which ones to keep (1), called locitokeep.txt.
-Zg wrote a perl script, randomLocusSet.pl, to run with var_sub_varsphapdata to select only those 500 loci. New file is called 500_var_sub_varsphapdata.
7. Ran calcobservedABC.pl on 500_var_sub_var_sphapdata
#Got: coverage_500_var_sub_var_sphapdata, obsSS_500_var_sub_var_sphapdata
8. Added ms to my path.
perl ./sp_popmigABC.pl coverage_500_var_sub_var_sphapdata 500 173 ##500 is the number of sims to run, 173 is the random seed, reminder: prior distributions are the same for each taxon
9. 500 sims took 17 hours. I thought it would take 8 hours based on how long the other taxa took. This means, if I used one processor at a time, it would take 34,000 hours (1416 days) to run 1 million simulations (yikes). We'll be running these in batches -- see #11.
10. See "sp_abcsummarystats.R" for R script for doing these diagnostic tests: See how well parameters are correlating with summary stats (12xii14_sp_parscor.pdf; 12xii14_sp_parscor_pca.pdf)). See where observed summary stats fall within distribution of simulated summary stats (12xii14_sp_ssobs.pdf). See how much summary stats are correlating with each other (12xii14_sp_sscr.pdf). See histograms of the parameters (12xii14_sp_param.pdf).
#All look like they do for the other taxa (diagnostics ran 9v12).
11. Running ABC in batches:
Use wrap_qsub_rc_abc.pl. Only need to change:
my $nsim = 1000
my $nrep = 50
#Runs for 96 hours, so can run as many sims as will fit in 96 hours.
#It will write to a local disk but will automatically zip it up and copy it back to here.
#To submit: perl wrap_qsub_rc_abc.pl coverage_500_var_sub_var_sphapdata 678 #Needs a single random number.
#12Dec14: job #s 166169-166218
Notes:
#catabc.pl concatenates simulations across different runs (need to get rid of the headers except one).
#If simulations stop short of writing all the summary stats, use the script dropShortLines.pl