16 - Dealing with different library sizes after high throughput metabarcode sequencing
How to deal with variable sequence library sizes boils down to which objective is being addressed: 1) normalization prior to alpha/beta diversity analyses, 2) normalization prior to differential abundance analysis (DAA).
Whether or not you should treat read counts as an indicator of abundance (biomass/density) is up for debate. Similarly, how to derive biological meaning from read counts when faced with the reality of DNA extraction bias, primer-bias, and mixed-template PCR will not be considered in this post.
Here, I quickly review some of the recent literature that deals generally with library size normalization.
Normalization prior to alpha/beta diversity analyses:
Simply put, alpha and beta diversity measures are strongly influenced by sequencing depth. Normalization of variable library sizes help us make fair comparisons and avoid false positives (detecting differences among samples due solely to the effect of sequencing depth). In a nutshell, Weiss et al., 2017 advocates for the continued use of simple rarefaction. In this paper, they rarefy to the lowest 15th percentile library size. This should help prevent removing too much data compared with rarefying down to the smallest library size.
Normalization prior to differential abundance analysis:
Chen et al., 2018 advocate for transforming data using the geometric mean of pairwise ratios (GMPR). They make the case that GMPR outperforms other methods for transforming zero-inflated matrices, the type commonly produced during metabarcoding studies. This method should compensate for compositional effects that could lead to false positive identification of taxa with differential abundances. They compared their method to other commonly used transformation methods including cumulative sum scaling (CSS), relative log expression (RLE/RLE+ available in DESeq2), trimmed mean of M values (TMM/TMM+ available in edgeR), total sum scaling (TSS). GMPR is available as an R package (GMPR_0.1.3.tar.gz).
Here are some papers on the topic:
Chen et al., 2018 - GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data
Weiss et al., 2017 - Normalization and microbial differential abundance strategies depend upon data characteristics
Gloor et al., 2017 - Microbiome Datasets Are Compositional: And This Is Not Optional
McMurdie and Holmes, 2014 - Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible
Gihring et al., 2011 - Massively parallel rRNA gene sequencing exacerbates the potential for biased community diversity comparisons due to variable library sizes