Paper Discussion

plasmidSPAdes: assembling plasmids from whole genome sequencing data. Antipov, D., Hartwick, N., Shen, M., Raiko, M., Lapidus, A. and Pevzner, P.A., 2016.

Introduction

Plasmids are small, double stranded extrachromosomal DNA molecules that do not belong to the usual bacterial genome. They can replicate independently using the host machinery and sometimes possess anti-bacterial properties. Plasmids are also used as vectors or carriers of foreign genetic sequences inside another cell. Bioinformatically, there are not many tools and software’s that can help in extracting these plasmids accurately out of a WGS data. When a bacterial genome is assembled, getting plasmid information without any prior plasmid isolation is difficult because it is hard to know which contigs have arisen as a result of a plasmid.

PlasmidFinder is a tool to search plasmids but it only searches based on a reference of plasmid sequences, thus it gets difficult to extract novel plasmids. Another tool that uses information from scaffolding links, compares to a list of plasmid references, and diagnoses replication initiator protein is PLAsmid Constellation NETwork (PLACNET).

In this paper, the authors focus on only using WGS assembly for plasmid reconstruction by using the information encoded in the de Bruijn graph. They utilize the SPAdes assembler, which is known to significantly improve plasmid assembly, to sequence plasmids from WGS data. PLASMIDSPAdes analyzes C. freundii CFNIH1 genome with well annotated plasmids and identifying novel plasmids (7 in this paper). The algorithm re-assembles the genomes, identifies their plasmids and supplements NCBI with corresponding plasmid annotation.

Method

PlasmidSPAdes uses the de Bruijn graph generated by SPAdes. Plasmid coverage can either be extremely low due to only being present in a few cells or extremely high from having a large copy number. Therefore, the median coverage is chosen as a statistic rather than the mean coverage. The edges in the graph are classified as either short or long based on a set parameter of contig length. Using long contigs helps exclude repeat sequences, and long contigs also have a low variance in coverage. Edges are then classified as either chromosomal if they satisfy the following:

The algorithm then iteratively removes long chromosomal edges and dead-end edges (since plasmids are expected to be cyclical) unless the long edge is part of a large connected component with no dead-end edges. Non-branching paths are then replaced with single edges before the next iteration. When there are no more edges to remove, plasmidic components are selected as single loop edges of a set length or of connected components of set size. ExSPAnder is then used to preform repeat resolution on the graph, and finally the plasmid contigs are aggregated into connected components. The final output for this algorithm is a plasmid graph that contains no chromosomal contigs.

Results

One of the contributions of this paper is, to paraphrase the authors, a first analysis of accuracy of a plasmid reconstruction method over a diverse set of bacterial genomes. They selected 6 bacterial genomes that had been annotated for plasmids, and 10 that had yet to be annotated for plasmids.

After running their approach on these 16 genomes, they took each sequence identified by their approach and used BLAST to find the best hit (defined by lowest e-score) on a database containing the longest contig from putative plasmids. If the best hit was a confirmed plasmid, they scored this as 'Y'. They noted that they could not confirm Then, for the remaining, they used a system called RAST to see if their sequences contained plasmid specific proteins. These were labeled 'NA', to signify potentially new plasmids identified by their method. Some of the identified sequences are likely phage or mobile element sequences, and were given a corresponding label.

Our Analysis

This paper addresses an interesting problem. Plasmids are important genetic factors that are often correlated to virulence factors and antibiotic resistance. The paper puts forth a novel approach for plasmid assembly by classifying contigs of a de Bruijn graph and then filtering chromosome contigs out, leaving plasmid sequences. However, the algorithm assumes that chromosome contigs have a small variance in coverage and that plasmid coverage is either much higher or lower. They also show in their paper that this is not always the case. Coverage may be increased at the origin of replication for cells in the replication process, so the variance of coverage will be much larger in these cases and their method won't catch chromosomal contigs without risking losing true positives. The method also struggles to find plasmids with a coverage level in the same range of the chromosomal coverage.

We also didn't feel that they bench-marked and validated their algorithm appropriately. For one, they included an unannotated dataset, and without a ground truth, it is difficult to accept their claim of all the plasmids they found. Some were believable, but other plasmids felt like speculation. Secondly, they did not bench-mark their method with any other method (e.g. PlasmidFinder and PLACNET). Therefore, it is difficult to determine how well their algorithm actually works compared to what is already available.

We are not sure if we want to build our project off of this paper. If we do chose to use this paper, one possible route is to build on the model in order to face some of the shortcomings and challenges that their current algorithm faces.