CSP detects chimeric fragments in dilution-based sequencing.

Version 1.0.0 (17/09/13)




CSP can be downloaded from the arrow mark referenced at the bottom of this page.




Matsumoto, H., & Kiryu, H. (2014). Integrating dilution-based sequencing and population genotypes for single individual haplotyping. BMC Genomics, 15(1), 733. [PubMed]





Calculating CSP


CSP is calculated with two steps.

In the first step, haplotypes probabilities for each SNP fragment region are calculated with the statistical phasing.

We use PHASE [1,2] for the statistical phasing and PHASE have to be installed to calculate CSP.


ruby CSP1.rb <Genotype_file> <Fragment_file> <Output_file1> <PHASE_file1> <PHASE_file2> <N> <W>

Example of running CSP1:

ruby CSP1.rb example/genotype.txt example/fragment.txt out/csp1.txt phase/input.txt phase/output.out 11 5


This contains population genotypes information.

Format of the file is

<chromosome number> <chromosome position> <refSNP> <base1> <base2> <genotype1> <genotype2> <genotype3> ...

where <genotype(n)> is n-th individual genotype of a SNP.

<genotype1> has to be an individual who is the target of the dilution-based sequencing.

Example of the file is as follows.

1 52066 rs28402963 T C 10 01 01 00 01 00 00 00 10 00 00

1 695745 . G A 10 00 00 00 00 00 00 10 00 00 00

1 766409 rs12124819 A G 01 01 00 00 00 10 01 11 11 00 11

1 801628 . C T 01 00 01 00 00 00 00 00 00 00 00

1 805678 . A T 01 -- -- -- -- -- -- -- -- -- --

1 805716 . A G 01 -- -- -- -- -- -- -- -- -- --

1 806222 . G A 01 00 00 10 10 00 11 01 00 00 10

In our paper, we generated this file from CEU genotypes, which were downloaded from 1000 genomes project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz and ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/low_coverage/snps/CEU.low_coverage.2010_07.genotypes.vcf.gz).


This contains SNP fragments.

Please see MixSIH page for detailed explanation.


This contains the haplotypes and these probabilities of the target individual for each SNP fragment.

Because we use sliding-window calculation, a SNP fragment appears many times.

Format of the file is

<SNP fragment name>

<haplotype1_1> , <haplotype1_2> , <probability of haplotype1>

<haplotype2_1> , <haplotype2_2> , <probability of haplotype2>


Example of the file is as follows.


001 , 110 , 0.100

000 , 111 , 0.899


0001 , 1110 , 1.000


00000 , 11111 , 1.000


00000 , 11111 , 1.000


00001 , 11110 , 0.670

00011 , 11100 , 0.330


This is a temporal file to create input file for PHASE.


This is a prefix of the output files of PHASE.


N is the number of individual genotypes.


W is the sliding-window width.

We use W=5 for default setting.

In the second step, CSP for each SNP fragment are calculated using the results of CSP1.rb.


ruby CSP2.rb <Output_file1> <Fragment_file> <Output_file2> <W>

Example of running CSP2:

ruby CSP2.rb out/csp1.txt example/fragment.txt out/csp2.txt 5


This is the output file of CSP1.rb.


This contains the CSP values for each SNP fragment.

Format of the file is

<SNP fragment name> <CSP>

[1] Stephens, Matthew, Nicholas J. Smith, and Peter Donnelly. "A new statistical method for haplotype reconstruction from population data." The American Journal of Human Genetics 68.4 (2001): 978-989.

[2] Stephens, Matthew, and Peter Donnelly. "A comparison of bayesian methods for haplotype reconstruction from population genotype data." The American Journal of Human Genetics 73.5 (2003): 1162-1169.