Stygoparnus variant calling

Post date: Oct 10, 2014 8:0:13 PM

#I handed things over to Kate in May 2014. Here were my instructions to her:

-----

1. There is a new directory on greenhouse for Stygoparnus. See /Volumes/data/stygoparnus/

In this directory is the barcode file I used. Kate, I changed your individual sample names so they match the format of the sample names for the other three Comal Springs datasets (that way, if you use my scripts from the Stygobromus project, you shouldn't have to change regular expressions very much). They look like this: S-C-CS-PA-001. S is for Styogoparnus, C is for comalensis, CS is for Comal Springs, PA is for the subpopulation, and then a unique digit. Here are the subpopulation codes and names:

PA: panther canyon well (9 individuals)

R1: spring run 1 (15 individuals)

R2: spring run 2 (8 individuals)

SI: spring island (21 individuals)

2. In the assembly/ directory are the following files:

stygoparnus_ref.fasta #this is the reference used in the assembly I did on sunflower with xng.

mod_parsed_clean_lan3_Undetermined_R1.cat.fastq #this is the sequence file that has been rid of contaminants, parsed, and whose individuals names have been fixed (with fixnames.pl).

ref_6ix13-0.bam #this is the bam file from the reference-based assembly I did on sunflower with xng.

ref_6ix13-0.bam.bai

ref_6ix13-0.info #There are 74,574,634 reads.

3. Would you pick up with variant calling? I can't remember where you do variant calling. If it's on sunflower, let me know because all my files are there, too.

You can copy necessary scripts from /Volumes/data/UW5-2/assemblies/stygo/variant_calling/

Then, run: remove_multallele.pl, remove_invariant.pl, remove_binom.pl

4. Run getHighCov.pl (also found in Volumes/data/UW5-2/assemblies/stygo/variant_calling/). use N = 5 in the command line followed by all 4 population names.

Use this file downstream: cov5_var_fin_m2_mod_snpcnt.txt

How many SNPs do we have now with "high coverage"? Contigs? What's the mean number of sequences per SNP per individual (coverage)?

5. Then run split_pop.pl on cov5_var_fin_m2_mod_snpcnt.txt

6. Then run all 4 snpfiles on the cluster, 2 chains each to get allele freq / theta estimations. You should get 2000 steps per chain. I used these parameters for the other projects:

alleleEst -l 100000 -g (infile) -e 0 -t 45 -b 10000 -n 1 -u 0.03 -p of_af0snfile_cov5_S-C-CS-PA

8. Average gprobs, make PCA infiles and PCA. See /Volumes/sims/lkl/stygo_CS/afreq_HighCov/ for helpful scripts.

9. From there I can work on pairwise GST, NMDS, and ABC.

--------

#Here are her notes on what she did:

KLB 9vii14

1. modify reference so lines are interleaved

perl /Volumes/data/scripts/fixfastaV2.pl stygoparnus_ref.fasta

2. get rid of line endings in the modified ref

greenhouse:assembly kbell$ perl -pi -e 's/\r//g' mod_stygoparnus_ref.fasta

3. Copy param file and parsebam2 from stygo directory. Modify pathways in parameter file and rename stygoparnus_paramfile.txt

4. run parseBAM2

perl parseBAM2.pl stygoparnus_paramfile.txt &

Started at 7.07pm.

KLB 10vii14

ParseBAM2 did not run properly. No hapcnt, hapseq or snpcnt files were produced.

Will run again and pipe output to a text file so I can figure out what is wrong. Started at 1:13PM.

KLB 28vii14

Zach has fixed the parseBAM2 script, the issue resulted from the fact that the quality scores now contain @ and so the script was unable to distinguish between IDs and quality scores.

Started perl parseBAM2_26July14.pl stygoparnus_paramfile.txt at 1:28pm.

No sequences were found with a phred score higher than or equal to 30. Not sure if this is an issue with the actual phred scores or if something else is going wrong.

KLB 6viii14

Will try re-running parse bam with a low phred score (10) to see what happens.

Moved all the files from the first run of parsebam into a new directory called run 1.

started:

greenhouse:assembly kbell$ perl parseBAM2_26july14.pl stygoparnus_paramfile_run2.txt >outparseBAM_run2.txt

[fai_build_core] different line length in sequence 'ref_contig'.

[1]+ Stopped perl parseBAM2_26july14.pl stygoparnus_paramfile_run2.txt > outparseBAM_run2.txt

greenhouse:assembly kbell$ bg

[1]+ perl parseBAM2_26july14.pl stygoparnus_paramfile_run2.txt > outparseBAM_run2.txt &

at 12:01pm.

Checked the output, again no variable sequences were found even with the low phred score. Seems like something else must be going on.

11viii14 KLB

There was a regular expression issue at line 111 of the parse bam script, this has now been fixed by zach and lauren.

Started run 3 using the new script (and back to a phred score of 30).

perl parseBAM2_08aug14.pl stygoparnus_paramfile_run3.txt > out_parseBAM_run3.txt &

Started at 10:51am.

12viii14 KLB

Checked on parseBAM results, again no variable sequences are found. However, this time the .bam files have been made. Checked file size, they look OK (i.e. not just empty!)

Will try running at a low phred score again incase there is some sort of quality issue. We are running version 1.18 of samtools, could there be an issue with this version using the old set of quality scores?

started run 4 at 12:42

noticed this gets printed to screen:

[fai_build_core] different line length in sequence 'ref_contig'.

Retrieving individual ids from: /Volumes/data/stygoparnus/assembly/mod_parsed_clean_lane3_Undetermined_R1.cat.fastq

Not sure if the different line lengths could be a problem?

18viii14 KLB

running with a phred score of 10 did not help, still no variable sites found.

Will try with a phred score of 0. started run 5 (phred =0) at 3:30pm.

Still no variable sites found when run 5 finished.

KLB 19viii14

Will try using the unmodified reference file. First need to get rid of line endings

perl -pi -e 's/\r//g' stygoparnus_ref.fasta

This fixed the problem.

Updated param file to use the unmodified version of the reference, also changed back the phred score to 30, started run 6 at 9.45am.

Checked how the run was going, the out file where I saved what had been printing to screen has the following message:

Fasta file should have only 60 characters on line or samtools will fail

The run has stopped. Will have to re-think whats going on.

UPDATE: Tried re-making the mod reference file with the stygoparnus_ref.fasta that has fixed line endings.

Updated the paramaterfile and started run7 at 12:15pm.

Checked on how the run was going at 8.08pm, it had stopped, we had run out of space in data, cleared some files up and will re-start. Will go through files tomorrow and organize to clear up some more space.

KLB 20viii14

The repeat of run 7 is finished. It worked! The problem seems to have come from the way in which the modified reference was made. Line endings were present in the unmodified reference, when I ran the fixfasta.pl script they were also present in the modified version of the reference, so I used a the perl one liner to remove them. When I removed the line endings in the unmodified version of the reference, and then ran fixfasta.pl the modified reference now works fine.

I can not carry on with the rest of the protocol and remove the directories from the old runs.

Moved files made during parsebam to /Volumes/data/stygoparnus/assemblies/variant_calling

21viii14 KLB

Added sympolic links to the script files (both general and Laurens).

6. Ran perl scripts_general/remove_multallele.pl hapcnt.txt hapseq.txt snpcnt.txt 2

This produced mod_hapseq mod_hapcnt and mod_snpcnt.

7. perl scripts_general/remove_invariant.pl mod_hapseq.txt mod_hapcnt.txt mod_snpcnt.txt

this produced m2_mod_hapcnt m2_mod_hapseq and m2_mod_snpcnt

8. perl scripts_general/remove_binom.pl m2_mod_snpcnt.txt

Produced fin_m2_mod_snpcnt

25viii14

Changed the regular expresison in getHighcov.pl and ran it.

perl getHighCov.pl fin_m2_mod_snpcnt.txt 5 S-C-CS-PA S-C-CS-R1 S-C-CS-R2 S-C-CS-SI

You kept 191678 loci with 5 or more reads per population

9. Changed regular expression in getPopCov.pl and ran. Produced popcoverage.csv

10. Ran countcov.pl, There are a total of 28,491,061 reads(?), divided this by the number of SNPs (191,678) multiplied by the number of individuals (53). This results in an average of 2.8 reads per individual.

11. head -n 55 stygoparnus_cov5_var_fin_m2_mod_snpcnt.txt

All individuals have been kept up to this point.

Estimating Allele Frequencies and Genotype Probabilities

-------------------

#Then I picked up where Kate left off.

Page updated

Google Sites

Report abuse