SeqSelectorTools to select sequences for capture enrichment of next-gen libraries
Whole genome sequences are increasingly becoming available from both model and non-model species. Even when incomplete and lacking functional annotations, they can still provide a wealth of biological information, and can serve as useful resources for selecting sequences for capture enrichment and subsequent re-sequencing. Here we describe a workflow and a user-friendly, platform independent toolset to facilitate selection of sequences for capture enrichment. The interactive scripts require no knowledge of programming, and can be applied to any genome sequence. The suggested workflow begins with the identification of candidate loci, e.g. by utilizing publically available functional gene annotation datasets or other published data. The corresponding sequences are then selected from the reference genome, and can then be used to query the unannotated genome of a non-model species using a BLAST search, thus providing targeted sequences for bait design and library enrichment. We demonstrate the performance of our toolset by comparing the resulting exon sequences obtained through the full SeqSelector workflow, to those obtained using annotation information from the same subject genome. We found that >96% of the selected sequences were from the correct gene for species that diverged from the reference species up to 15 million years ago. The sequences that did not match arose due primarily to differences in the annotations between the reference and subject, and thus do not result from limitations of the workflow itself. Overall, the SeqSelector toolset and workflow provide an accurate, efficient, and user-friendly method for selecting regions for capture enrichment across the genomes of model and non-model species.
Download the program, documentation, and examples at https://sourceforge.net/projects/seqselector/files/SeqSelector1.1b/
*Update! 26 March 2015 - Apologies for the delay, but I will be releasing some new features soon. Feel free to email me for beta version or stay tuned for more...
Figure 1. Overview of the SeqSelector workflow demonstrating major steps and associated tools. Gray shading represents
steps using annotated genome sequences (the ‘working reference’) or additional published and unpublished data. Blue text
and dashed arrows indicate steps that can be used to obtain sequences of genes of interest from the reference species genome,
while green text and dotted arrows indicate the starting point when EST or transcriptome sequences are available. Gray dotted
arrows indicate an optional step.