created by shlee
on 2016-06-22
This tutorial shows how to generate simulated reads against a specific target sequence. This can be useful, e.g. if you want to simulate reads for an alternate contig in GRCh38/hg38 to see how they end up mapping between the primary assembly versus the alternate contig.
We use external tools to accomplish this. In Section 1, we use Samtools to subset the target contig sequence from a reference FASTA file. In Section 2, we use wgsim to generate FASTQ format paired reads against the target contig. The resulting read data is ready for alignment.
This tutorial provides example data for you to follow along and includes a mini-reference FASTA. If you are unfamiliar with terms that describe reference genome components, take a few minutes to study the Dictionary entry Reference Genome Components.
This tutorial uses external tools that may require additional dependencies, e.g. the gcc compiler, that may not be available by default on your system.
chr19_chr19_KI270866v1_alt.fasta and corresponding .dict dictionary and .fai index, the subset chr19_KI270866v1_alt.fasta and final 7859_GPI.read1.fq and 7859_GPI.read2.fq FASTQ files.chr19 and chr19_KI270866v1_alt. In the tutorial, we simulate reads for the 43,156 bp ALT contig. The ALT contig corresponds to a diverged haplotype of chromosome 19. Specifically, it corresponds to chr19:34350807-34392977, which contains the glucose-6-phosphate isomerase or GPI gene. Part of the ALT contig introduces novel sequence that lacks a corresponding region in the primary assembly.Each contig in the reference FASTA has a header line beginning with > that identifies the contig sequence that follows. We need the exact representation of this header line to subset the target contig sequence. The UNIX command below lists all such headers for the FASTA file.
grep '>' chr19_chr19_KI270866v1_alt.fasta
This prints the following for our mini-reference chr19_chr19_KI270866v1_alt.fasta. ````
Use the faidx option of Samtools to subset the ALT contig sequence to a new FASTA file, chr19_KI270866v1_alt.fasta.
samtools faidx chr19_chr19_KI270866v1_alt.fasta chr19_KI270866v1_alt > chr19_KI270866v1_alt.fasta
To introduce variants into reads, alter the FASTA sequence at this point before simulating reads. For example, to introduce a simple heterozygous SNP, duplicate the contig information within the file, name the duplicate contig differently, and change the base within the duplicated sequence. Search for the target base's sequence context by using TextEdit's Find function. Keep in mind FASTA file sequences contain line breaks.
To generate an alternate FASTA reference based on a VCF of variants, see GATK’s FastaAlternateReferenceMaker.
Generate simulated reads from chr19_KI270866v1_alt.fasta with the following command.
wgsim -1151 -2151 -d500 -r0 -e0 -N10000 -R0 -X0 chr19_KI270866v1_alt.fasta 7859_GPI.read1.fq 7859_GPI.read2.fq
This gives two FASTQ files, 7859_GPI.read1.fq and 7859_GPI.read2.fq, one for each mate of the paired reads.
-1151 and -2151 for read1 and read2, respectively.-d500 parameter.-N10000 parameter.-R0 & -X0) nor mutations/variants (-r0).-e so we set it to zero (-e0) for base quality scores of I, which is, according to this page and this site, an excellent base quality score equivalent to a Sanger Phred+33 score of 40.For a 43 kb contig, 10K x 2 x 151 reads should give us ~70x hypothetical coverage. Here are two pairs of reads from 7859_GPI.read1.fq and 7859_GPI.read2.fq.
7859_GPI.read1.fq
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/1 AGGTATGAGGATCTGGGTCTTCCCGTGTCTGAGTAGGTAGCACCTGGCACAGGTATGAGGATATGGGTCTTCCATGTCTGAGGAGGTAGCACCTGGCACAGATATGAGGATCTGCGTCTTCCAGTGTTTGAGGAGGTGAGTTTGGACTCAG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/1 CACCACTGCTGAGCTCAGGCAAGTGCACAAGGAAAGCTGTGGCTCACTGCTCGGCTCCAGCAGAGGTGGTCCCATGGACCACCTGTTGCTACAGAGGGGTCGGCAGCCCTGTCACTCAAGGCAGGGTTTGCTCTGCAAGCTGCCCCAGCCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
7859_GPI.read2.fq
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/2 AGGGCCAGATCACACCTCCTCAGATATTGACCGACCCAGATCCTTATACCTGCACCAGATCCTACCTCCTCAGGCATTGACAGATCCAGATCCTTATACTTGTGCCAGATCCTACCTCCTTAGACATGGACAGACCCAGATCCTCATACCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2 AGGCCCATGAGGTCAGGTCAGTGTTTATTGAGTACCTGCTGCATACCTAGCTTGGGGAAAGGTAGAGAGGCCCTCAGAGAGGCTTGGAGGGCAAGAGCAACCCAGGCAGGATGAGGGCTCCACTTCCACCTGAGGGCGGGCTGAGCTTGCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
All the bases of all the reads from a simulation have the same base quality, and in this instance each base quality is I. Notice the read names of the simulated reads contain useful information, e.g the last read name @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2 consists of the following.
chr19_KI270866v1_alt30797_313410:0:0 and the same for the mate 0:0:01/2Updated on 2016-12-22
chr19 chr19KI270866v1alt ````