Nickols William A, Schwabl Philipp, Niangaly Amadou, Murphy Sean C, Crompton Peter D, Neafsey Daniel E
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
bioRxiv. 2025 Feb 8:2025.02.06.636982. doi: 10.1101/2025.02.06.636982.
Longitudinal pathogen genotyping data from individual hosts can uncover strain-specific infection dynamics and their relationships to disease and intervention, especially in the malaria field. An important use case involves distinguishing newly incident from pre-existing (persistent) strains, but implementation faces statistical challenges relating to individual samples containing multiple strains, strains sharing alleles, and markers dropping out stochastically during the genotyping process. Current approaches to distinguish new versus persistent strains therefore rely primarily on simple rules that consider only the time since alleles were last observed.
We developed DINEMITES (stinguishing w alaria nfections in ime eries), a set of statistical methods to estimate, from longitudinal genotyping data, the probability each sequenced allele represents a new infection harboring that allele, the total molecular force of infection (molFOI, the cumulative number of newly acquired strains over time) for each individual, and the total number of new infection events for each individual. DINEMITES can handle time points with missing sequencing data, incorporate treatment history and covariates affecting the rate of new or persistent infections, and can scale to studies with thousands of samples sequenced across multiple loci containing hundreds of possible alleles. In synthetic evaluations, the DINEMITES Bayesian model, which generally outperformed an alternative clustering-based model also developed in this work, accurately estimated key clinical parameters such as molFOI (bias 2.5, compared to -12.2 for a typical simple rule). When applied to three real longitudinal genotyping datasets, the model detected 33%, 112%, and 359% more average infections per participant than would have been detected by applying a typical simple rule to the equivalent datasets without sequencing.
来自个体宿主的纵向病原体基因分型数据可以揭示菌株特异性感染动态及其与疾病和干预措施的关系,尤其是在疟疾领域。一个重要的用例是区分新感染菌株和既往(持续存在)菌株,但在实施过程中面临统计挑战,这些挑战涉及包含多种菌株的个体样本、共享等位基因的菌株以及基因分型过程中随机缺失的标记。因此,目前区分新菌株和持续存在菌株的方法主要依赖于仅考虑等位基因上次观察时间以来的简单规则。
我们开发了DINEMITES(区分疟疾感染时间序列),这是一套统计方法,用于从纵向基因分型数据中估计每个测序等位基因代表携带该等位基因的新感染的概率、每个个体的总感染分子力(molFOI,即随时间新获得菌株的累积数量)以及每个个体的新感染事件总数。DINEMITES可以处理测序数据缺失的时间点,纳入治疗史和影响新感染或持续感染率的协变量,并且可以扩展到对数千个样本进行测序的研究,这些样本跨越多个位点,包含数百个可能的等位基因。在综合评估中,DINEMITES贝叶斯模型通常优于本研究中开发的另一种基于聚类的模型,它准确地估计了关键临床参数,如molFOI(偏差为2.5,相比之下,典型简单规则的偏差为-12.2)。当应用于三个真实的纵向基因分型数据集时,与对等效数据集应用典型简单规则且不进行测序相比,该模型检测到的每个参与者的平均感染数分别多33%、112%和359%。