CSP
CSP detects chimeric fragments in dilution-based sequencing.
Version 1.0.0 (17/09/13)
---------
Download
---------
CSP can be downloaded from the arrow mark referenced at the bottom of this page.
-----------
Reference
-----------
Matsumoto, H., & Kiryu, H. (2014). Integrating dilution-based sequencing and population genotypes for single individual haplotyping. BMC Genomics, 15(1), 733. [PubMed]
------
Slide
------
------------------
Calculating CSP
------------------
CSP is calculated with two steps.
In the first step, haplotypes probabilities for each SNP fragment region are calculated with the statistical phasing.
We use PHASE [1,2] for the statistical phasing and PHASE have to be installed to calculate CSP.
Usage:
ruby CSP1.rb <Genotype_file> <Fragment_file> <Output_file1> <PHASE_file1> <PHASE_file2> <N> <W>
Example of running CSP1:
ruby CSP1.rb example/genotype.txt example/fragment.txt out/csp1.txt phase/input.txt phase/output.out 11 5
Genotype_file:
This contains population genotypes information.
Format of the file is
<chromosome number> <chromosome position> <refSNP> <base1> <base2> <genotype1> <genotype2> <genotype3> ...
where <genotype(n)> is n-th individual genotype of a SNP.
<genotype1> has to be an individual who is the target of the dilution-based sequencing.
Example of the file is as follows.
1 52066 rs28402963 T C 10 01 01 00 01 00 00 00 10 00 00
1 695745 . G A 10 00 00 00 00 00 00 10 00 00 00
1 766409 rs12124819 A G 01 01 00 00 00 10 01 11 11 00 11
1 801628 . C T 01 00 01 00 00 00 00 00 00 00 00
1 805678 . A T 01 -- -- -- -- -- -- -- -- -- --
1 805716 . A G 01 -- -- -- -- -- -- -- -- -- --
1 806222 . G A 01 00 00 10 10 00 11 01 00 00 10
In our paper, we generated this file from CEU genotypes, which were downloaded from 1000 genomes project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz and ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz).
Fragment_file:
This contains SNP fragments.
Please see MixSIH page for detailed explanation.
Output_file1:
This contains the haplotypes and these probabilities of the target individual for each SNP fragment.
Because we use sliding-window calculation, a SNP fragment appears many times.
Format of the file is
<SNP fragment name>
<haplotype1_1> , <haplotype1_2> , <probability of haplotype1>
<haplotype2_1> , <haplotype2_2> , <probability of haplotype2>
...
Example of the file is as follows.
frag1
001 , 110 , 0.100
000 , 111 , 0.899
frag2
0001 , 1110 , 1.000
frag3
00000 , 11111 , 1.000
frag3
00000 , 11111 , 1.000
frag3
00001 , 11110 , 0.670
00011 , 11100 , 0.330
PHASE_file1:
This is a temporal file to create input file for PHASE.
PHASE_file2:
This is a prefix of the output files of PHASE.
N:
N is the number of individual genotypes.
W:
W is the sliding-window width.
We use W=5 for default setting.
In the second step, CSP for each SNP fragment are calculated using the results of CSP1.rb.
Usage:
ruby CSP2.rb <Output_file1> <Fragment_file> <Output_file2> <W>
Example of running CSP2:
ruby CSP2.rb out/csp1.txt example/fragment.txt out/csp2.txt 5
Output_file1:
This is the output file of CSP1.rb.
Output_file2:
This contains the CSP values for each SNP fragment.
Format of the file is
<SNP fragment name> <CSP>
[1] Stephens, Matthew, Nicholas J. Smith, and Peter Donnelly. "A new statistical method for haplotype reconstruction from population data." The American Journal of Human Genetics 68.4 (2001): 978-989.
[2] Stephens, Matthew, and Peter Donnelly. "A comparison of bayesian methods for haplotype reconstruction from population genotype data." The American Journal of Human Genetics 73.5 (2003): 1162-1169.