Scaffolding and detecting structural variants with MetaCarvel

Large-scale microbiome studies characterizing complex microbial communities increasingly demonstrate the importance of the microorganisms around us. By reconstructing or assembling the genomes of different microbes within a population, we can extract gene content and identify genomic variation that may alter microbial functional or pathogenic capacities and contribute to human and ecological health and disease.

Detection of variants in metagenomes is challenging. The composition and abundance of genomes within samples are typically unknown and many variant detection methods are reference-based, preventing the characterization of variation within “microbial dark-matter” (sequences of unknown lineage and function). Most existing assemblers contain dedicated scaffolding modules that operate on high confidence contigs (unitigs) generated from assembly graphs and output scaffolds as a set of sequences, discarding information about how the scaffolds are constructed. Sequences corresponding to different strains are often broken into different scaffolds or collapsed within single contigs. As a result, we lose significant information about biological phenomena like gene loss, gain, or transfer. Furthermore, many existing assemblers do not perform variant detection because they apply conservative parameters that confound true variants with sequencing errors, which are “smoothed out” from the graph. Unlike reference or marker gene-based methods, assembly graph exploration provides insight into variation within novel microbes, however graph based variant identification has thus far been limited to detecting multi-allelic variants at particular genomic loci.

To address these challenges, we developed MetaCarvel1, a method for de-novo scaffolding and reference-free variant discovery in whole metagenomic shotgun sequencing datasets. MetaCarvel identifies five biologically-important motifs in assembly graphs. Bubbles manifest when contigs (nodes in an assembly graph) diverge into one or more paths before converging. Triangular bubbles, where one path contains only the source and sink contigs of the separation pairs, represent insertion/deletion (indel) events between genomes. Bubbles with exactly two paths between the source and sink contigs of the separation pairs represent simple strain variants (SSV), while bubbles with more than two paths and at least three contigs per path represent complex strain variants (CSV). Similarly, simple cycles of single contigs adjacently repeated denote tandem repeat sequences, while multi-node cycles represent circular plasmids. High centrality nodes, with a single contig flanked by multiple different contigs, typify interspersed repeats found throughout a genome or segments shared by multiple organisms in a sample.

We used MetaCarvel to scaffold the assemblies of 934 Human Microbiome Project (HMP)2,3 samples from six different body sites. The scaffolds generated by Metacarvel significantly increased the contiguity of the original HMP assemblies1 and identified over nine million variants within the samples.

Compared to alignment-based methods that can only identify SNVs and other small variants, assembly graph-based variant detection allows the resolution of sophisticated functional and structural changes in mixed microbial communities. The variant detection approach used by MetaCarvel is, thus, able to capture important biological signals that are not apparent within the assembled sequence or by comparison to a reference.

Please check out our presentation from ISMB 2020 to learn more!

References

1. Ghurye, J., Treangen, T., Fedarko, M., Hervey, W. J., 4th & Pop, M. MetaCarvel: linking assembly graph motifs to biological variants. Genome Biol. 20, 174 (2019).

2. The Human Microbiome Project Consortium. A framework for human microbiome research. Nature vol. 486 215–221 (2012).

3. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature vol. 486 207–214 (2012).