My focus as a data scientist researcher was the methodological development of statistical procedures and mathematical models to analyze genetic data sets. In particular, enhancing malaria transmission monitoring by developing more reliable methods, addressing the frequency distribution of genetically distinct pathogens (haplotypes), and analyzing the distribution of super- and co-infections (MOI) for accurate estimation in endemic areas. Furthermore, to understand and explain the relationship between the frequency, prevalence, and incidence of drug-resistant pathogens through the formulation of a formal statistical framework tailored to users.
Multiplicity of infection (MOI) denotes the number of super-infections due to multiple infectious contacts. Accurate MOI estimates from SNP or microsatellite data are crucial for clinical, genetic, and epidemiological insights. Molecular methods face limitations, leading to common incomplete information or undetected alleles. Despite MOI and allele frequencies being fundamental in malaria genetic studies, current methods inadequately account for unobserved genetic/molecular information, potentially biasing results. We propose a statistical model addressing this issue, employing the expectation-maximization (EM) algorithm to derive maximum-likelihood estimates (MLE) for MOI, allele-frequency spectrum, and prevalences. Our method accommodates patient blood samples with entirely missing information, exhibiting desirable analytical properties. Applied to a dataset from Asembo Bay, Kenya, the method performs well for realistic sample sizes. An R implementation is provided for estimating allele frequency spectra at a single SNP or microsatellite locus alongside MOI.
We propose a bias-corrected ML estimator for MOI and pathogen lineage frequencies using a single molecular marker. Heuristic adjustments compensate for bias correction shortcomings. Simulation results show successful bias removal, especially for extreme parameters, with heuristic adjustments improving correction, particularly for small sample sizes. Variances align with the Cramér-Rao lower bound, suggesting minimal room for improvement without additional information. The estimators demonstrate reasonable robustness against model violations. Applying bias corrections enhances MOI estimates' quality in both low and high transmission areas. Bias-corrected estimators are nearly unbiased, with variances close to the Cramér-Rao lower bound. Further improvements may come from combining data from multiple molecular markers or incorporating stratifying information.
Improved estimates of multiplicity of infection in malaria and related infectious diseases
Genetic measures, such as Multiplicity of Infection (MOI), have emerged as pivotal tools in this pursuit, offering distinct advantages over traditional metrics like the basic reproduction number. The recognition of MOI's significance is growing, especially in its ability to differentiate between pathogen variants at the genetic/molecular level. However, challenges arise, particularly in estimating MOI and pathogen-lineage frequency from molecular data, especially when dealing with small sample sizes prevalent in practice. To address biases in these estimates, researchers have proposed employing Maximum-Likelihood (ML) methods, coupled with analytical bias correction and heuristical adjustments.
Analytical bias correction proves effective in mitigating substantial biases in estimates, especially when sample sizes are limited. Further refinement is achieved through heuristical adjustments, resulting in estimates approaching unbiasedness, with variance aligning closely with the theoretical minimum (Cramér-Rao lower bound).
A noteworthy approach involves combining data from multiple molecular markers. By averaging MOI estimates from markers with desirable properties, this heuristic variance reduction method substantially minimizes the variance of MOI estimates. This becomes particularly relevant when dealing with common missing values in the data.
In an effort to understand the applicability of heuristic variance reduction, a simulation study delves into various scenarios, shedding light on situations where this approach proves most effective. This comprehensive exploration contributes valuable insights to the ongoing efforts to combat infectious diseases and underscores the significance of incorporating diverse data sources in the pursuit of accurate and reliable estimates.
The R package MLMOI was originally designed to derive maximum-likelihood estimates of prevalence, frequencies, and multiplicity of infection (MOI) from molecular data of infectious diseases such as malaria. These are important clinical, genetically, and epidemiological quantities in epidemiologic and evolutionary-genetic studies. A significant obstacle in analyzing appropriate data is the heterogeneous formats in which they are stored. A further source of complication is data entry errors or inconsistent data entries, which occur frequently in practice, are difficult to detect, and can lead to wrong analyses. As a solution, MLMOI offers a flexible import function that allows for combinations of heterogeneous data formats. Furthermore, several potential data entry errors will be automatically detected and reported. The import function is not necessarily restricted to molecular data of infectious diseases similar to malaria. It can be used to import, clean, and transform other types of data as well, not necessarily restricted to genetic/molecular applications.