Kim Younhun, Worby Colin J, Acharya Sawal, van Dijk Lucas R, Alfonsetti Daniel, Gromko Zackary, Azimzadeh Philippe, Dodson Karen, Gerber Georg, Hultgren Scott, Earl Ashlee M, Berger Bonnie, Gibson Travis E
bioRxiv. 2024 Jul 23:2023.01.25.525531. doi: 10.1101/2023.01.25.525531.
The ability to detect and quantify microbiota over time has a plethora of clinical, basic science, and public health applications. One of the primary means of tracking microbiota is through sequencing technologies. When the microorganism of interest is well characterized or known , targeted sequencing is often used. In many applications, however, untargeted bulk (shotgun) sequencing is more appropriate; for instance, the tracking of infection transmission events and nucleotide variants across multiple genomic loci, or studying the role of multiple genes in a particular phenotype. Given these applications, and the observation that pathogens (e.g. ) and other taxa of interest can reside at low relative abundance in the gastrointestinal tract, there is a critical need for algorithms that accurately track low-abundance taxa with strain level resolution. Here we present a sequence quality- and time-aware model, , that introduces uncertainty quantification to gauge low-abundance species and significantly outperforms the current state-of-the-art on both real and synthetic data. ChronoStrain leverages sequences' quality scores and the samples' temporal information to produce a probability distribution over abundance trajectories for each strain tracked in the model. We demonstrate Chronostrain's improved performance in capturing post-antibiotic strain blooms among women with recurrent urinary tract infections (UTIs) from the UTI Microbiome (UMB) Project. Other strain tracking models on the same data either show inconsistent temporal colonization or can only track consistently using very coarse groupings. In contrast, our probabilistic outputs can reveal the relationship between low-confidence strains present in the sample that cannot be reliably assigned a single reference label (either due to poor coverage or novelty) while simultaneously calling high-confidence strains that can be unambiguously assigned a label. We also analyze samples from the Early Life Microbiota Colonisation (ELMC) Study demonstrating the algorithm's ability to correctly identify strains using paired sample isolates as validation.
随着时间推移检测和量化微生物群的能力在临床、基础科学和公共卫生领域有大量应用。追踪微生物群的主要方法之一是通过测序技术。当目标微生物得到充分表征或已知时,通常使用靶向测序。然而,在许多应用中,非靶向批量(鸟枪法)测序更合适;例如,追踪感染传播事件和多个基因组位点的核苷酸变异,或研究多个基因在特定表型中的作用。鉴于这些应用,以及观察到病原体(如 )和其他感兴趣的分类群在胃肠道中可能以低相对丰度存在,迫切需要能够以菌株水平分辨率准确追踪低丰度分类群的算法。在这里,我们提出了一种序列质量和时间感知模型,即ChronoStrain,它引入了不确定性量化来评估低丰度物种,并且在真实数据和合成数据上均显著优于当前的最先进技术。ChronoStrain利用序列的质量得分和样本的时间信息,为模型中追踪的每个菌株生成丰度轨迹的概率分布。我们展示了ChronoStrain在捕获来自尿路感染微生物组(UMB)项目的复发性尿路感染(UTI)女性患者抗生素后菌株爆发方面的改进性能。在相同数据上的其他菌株追踪模型要么显示出不一致的时间定植情况,要么只能使用非常粗略的分组进行一致的追踪。相比之下,我们的概率输出可以揭示样本中存在的低置信度菌株之间的关系,这些菌株由于覆盖不足或新颖性而无法可靠地分配单个参考标签,同时还能识别出可以明确分配标签的高置信度菌株。我们还分析了来自早期生命微生物群定植(ELMC)研究的样本,证明了该算法使用配对样本分离株作为验证来正确识别菌株的能力。