created by shlee
on 2016-06-22
This tutorial shows how to generate simulated reads against a specific target sequence. This can be useful, e.g. if you want to simulate reads for an alternate contig in GRCh38/hg38 to see how they end up mapping between the primary assembly versus the alternate contig.
We use external tools to accomplish this. In Section 1, we use Samtools to subset the target contig sequence from a reference FASTA file. In Section 2, we use wgsim to generate FASTQ format paired reads against the target contig. The resulting read data is ready for alignment.
This tutorial provides example data for you to follow along and includes a mini-reference FASTA. If you are unfamiliar with terms that describe reference genome components, take a few minutes to study the Dictionary entry Reference Genome Components.
This tutorial uses external tools that may require additional dependencies, e.g. the gcc compiler, that may not be available by default on your system.
chr19_chr19_KI270866v1_alt.fasta
and corresponding .dict
dictionary and .fai
index, the subset chr19_KI270866v1_alt.fasta
and final 7859_GPI.read1.fq
and 7859_GPI.read2.fq
FASTQ files.chr19
and chr19_KI270866v1_alt
. In the tutorial, we simulate reads for the 43,156 bp ALT contig. The ALT contig corresponds to a diverged haplotype of chromosome 19. Specifically, it corresponds to chr19:34350807-34392977, which contains the glucose-6-phosphate isomerase or GPI gene. Part of the ALT contig introduces novel sequence that lacks a corresponding region in the primary assembly.Each contig in the reference FASTA has a header line beginning with >
that identifies the contig sequence that follows. We need the exact representation of this header line to subset the target contig sequence. The UNIX command below lists all such headers for the FASTA file.
grep '>' chr19_chr19_KI270866v1_alt.fasta
This prints the following for our mini-reference chr19_chr19_KI270866v1_alt.fasta
. ````
Use the faidx
option of Samtools to subset the ALT contig sequence to a new FASTA file, chr19_KI270866v1_alt.fasta
.
samtools faidx chr19_chr19_KI270866v1_alt.fasta chr19_KI270866v1_alt > chr19_KI270866v1_alt.fasta
To introduce variants into reads, alter the FASTA sequence at this point before simulating reads. For example, to introduce a simple heterozygous SNP, duplicate the contig information within the file, name the duplicate contig differently, and change the base within the duplicated sequence. Search for the target base's sequence context by using TextEdit's Find function. Keep in mind FASTA file sequences contain line breaks.
To generate an alternate FASTA reference based on a VCF of variants, see GATK’s FastaAlternateReferenceMaker.
Generate simulated reads from chr19_KI270866v1_alt.fasta
with the following command.
wgsim -1151 -2151 -d500 -r0 -e0 -N10000 -R0 -X0 chr19_KI270866v1_alt.fasta 7859_GPI.read1.fq 7859_GPI.read2.fq
This gives two FASTQ files, 7859_GPI.read1.fq
and 7859_GPI.read2.fq
, one for each mate of the paired reads.
-1151
and -2151
for read1 and read2, respectively.-d500
parameter.-N10000
parameter.-R0
& -X0
) nor mutations/variants (-r0
).-e
so we set it to zero (-e0
) for base quality scores of I
, which is, according to this page and this site, an excellent base quality score equivalent to a Sanger Phred+33 score of 40.For a 43 kb contig, 10K x 2 x 151 reads should give us ~70x hypothetical coverage. Here are two pairs of reads from 7859_GPI.read1.fq
and 7859_GPI.read2.fq
.
7859_GPI.read1.fq
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/1 AGGTATGAGGATCTGGGTCTTCCCGTGTCTGAGTAGGTAGCACCTGGCACAGGTATGAGGATATGGGTCTTCCATGTCTGAGGAGGTAGCACCTGGCACAGATATGAGGATCTGCGTCTTCCAGTGTTTGAGGAGGTGAGTTTGGACTCAG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/1 CACCACTGCTGAGCTCAGGCAAGTGCACAAGGAAAGCTGTGGCTCACTGCTCGGCTCCAGCAGAGGTGGTCCCATGGACCACCTGTTGCTACAGAGGGGTCGGCAGCCCTGTCACTCAAGGCAGGGTTTGCTCTGCAAGCTGCCCCAGCCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
7859_GPI.read2.fq
@chr19_KI270866v1_alt_40173_40622_0:0:0_0:0:0_0/2 AGGGCCAGATCACACCTCCTCAGATATTGACCGACCCAGATCCTTATACCTGCACCAGATCCTACCTCCTCAGGCATTGACAGATCCAGATCCTTATACTTGTGCCAGATCCTACCTCCTTAGACATGGACAGACCCAGATCCTCATACCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2 AGGCCCATGAGGTCAGGTCAGTGTTTATTGAGTACCTGCTGCATACCTAGCTTGGGGAAAGGTAGAGAGGCCCTCAGAGAGGCTTGGAGGGCAAGAGCAACCCAGGCAGGATGAGGGCTCCACTTCCACCTGAGGGCGGGCTGAGCTTGCA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
All the bases of all the reads from a simulation have the same base quality, and in this instance each base quality is I
. Notice the read names of the simulated reads contain useful information, e.g the last read name @chr19_KI270866v1_alt_30797_31341_0:0:0_0:0:0_1/2
consists of the following.
chr19_KI270866v1_alt
30797_31341
0:0:0
and the same for the mate 0:0:0
1/2
Updated on 2016-12-22
chr19 chr19KI270866v1alt ````