Estimating selection coefficients with high precision

Gallet et al. (2012) Genetics

Mutations of small effect can play an important role in evolution, but they are difficult to measure experimentally because the precision with which fitness effects can be measured is relatively low. Such a low precision in fitness measures limits the ability to determine whether the fitness effect of a mutation varies across different environmental or genetic contexts and adds to other sources of stochasticity (Lenormand et al. 2009) to make it difficult to reliably predict evolutionary trajectories.

The selection coefficient (s) of a mutation can be defined from the expected change in its relative frequency (compared to a non-mutated ancestor) over one generation (e.g., Rousset 2004). Precisely measuring selection coefficients poses technical, conceptual, and statistical challenges. The technical challenge is to set up a technique that allows experiments to be carried out efficiently. First, it is important to study large populations in order to minimize the effect of genetic drift relative to selection at the intra-population level. Second, because of drift, replication is fundamentally necessary to estimate fitness, and the precision of a given fitness measure must account for the inter-replicate variance. Third, the experiment should not last too long to avoid the complications related to the selection of newly arising mutations. Fourth, inferring allelic selection coefficients against a common reference strain requires that genotypic fitness is transitive. These potential complications require adding proper controls to competition experiments. Fifth, a further complication is that fitness may vary because of changing environmental conditions. If selection varies, and it probably always does to some extent (Bell 2008; Bell 2010), measuring selection requires measuring both a mean and a variance (the latter not including sampling error). In summary, measuring selection with precision requires estimating an expectation over several replicates, so that its variance can be decomposed into components due to sampling error, drift, and variable selection.

From a statistical point of view, when selection can be approximated by a continuous process through time in an isolated population, a simple approach is to regress Log(p/q) (where p and q represent the frequencies of the two competitors) over time expressed in units of generations (Fisher 1930). The connection with logistic regression and general linear models is then straightforward (Arnason and Barker 1999) and more appropriate than the use of least squares. However, complications arise in the analysis of time series and correlated error in repeated measurement through time (Arnason and Barker 1999; O’hara 2005), especially when both drift and fluctuating selection cause frequency variation. The development of mixed models offers an attractive alternative to circumvent this problem and to measure selection and its variation.

In this project, we developed an approach combining several features to improve and quantify the precision of fitness measures. First, we use techniques that have proved to be among the most efficient to measure fitness: competition assay between large populations of Escherichia coli strains to minimize drift and engineered mutations (introduction of the CFP or YFP markers) to avoid the problem of indirect selection. We use two fluorescent markers (Rosenfeld et al. 2005) combined with flow cytometry (Lunzer et al. 2002) to measure frequency variation with great precision, and thus minimize sampling error.

Description of the method used to insert CFP and YFP markers at the RhaA locus in the E. coli genome.

Key aspects of our approach are as follows: A comprehensive set of four competition assays enables us to separately estimate mutational selection coefficients (a), the cost of the marker (b), epistasis between mutation and marker (g), and transitivity (t). We use short-term batch culture to facilitate massive replication and to reduce the possibility that de novo beneficial mutations will occur. We analyze the data in an integrated likelihood framework with random effects to partition sources of variation in our estimates (sampling error vs. drift vs. variable selection).

Panel a shows the distribution of E. coli cells either marked with CFP or YFP genomic marker. Panel b shows the selection coefficients estimates (10 measures per experiment by 4 experiments = 40 replicates) for the wild type and the three mini-Tn10 E. coli mutants, T63, T103 and T121, in the 4 different competition experiments (a: wc vs. wy, b: mc/my, c: my/wc, and d: mc/wy, with wc = wild-type CFP, wy = wild-type YFP, mc = mutant CFP, my = mutant YFP).

Our approach allowed us to estimate both mean and variance in selection coefficients at a precision of 0.02%. This precision allowed us to detect variation in measures of some mutation selection coefficients that were significantly larger than expected due to drift alone, indicating the action of some kind of cryptic variation during our competitions. This finding implies that, in practice, selection coefficients should be considered as being distributed and that precise measures require evaluating both the mean and the variance of this distribution. Furthermore, the variance in s indicates that some uncontrolled processes occur in these experiments (cryptic environmental or genetic variation), which impose a limit to further dissecting the differences seen across replicates.