Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16802, United States.
Department of Biology, Pennsylvania State University, State College, PA 16802, United States.
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae047.
In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low-abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, few tools account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome.
We present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of ANI, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power and how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach.
The source code implementing this approach is available via Conda and at https://github.com/KoslickiLab/YACHT. We also provide the code for reproducing experiments at https://github.com/KoslickiLab/YACHT-reproducibles.
在宏基因组学中,研究从环境样本中提取的微生物群落的 DNA,其中最基本的计算任务之一是确定给定样本宏基因组中来自参考数据库的哪些基因组存在或不存在。现有的工具通常返回点估计值,没有与之相关的置信度或不确定性。这导致从业者在解释这些工具的结果时遇到困难,特别是对于低丰度生物,因为它们通常存在于错误预测的“噪声尾部”中。此外,很少有工具考虑到参考数据库通常不完整,并且很少(如果有的话)包含环境衍生宏基因组中存在的基因组的精确副本。
我们通过引入算法 YACHT 来解决这些问题:通过假设检验来回答社区成员身份的是/否。该方法引入了一个统计框架,根据 ANI 考虑参考和样本基因组之间的序列差异,以及不完全的测序深度,从而提供了一个用于确定参考基因组在样本中是否存在的假设检验。在介绍我们的方法之后,我们量化了它的统计能力以及随着参数变化而如何变化。随后,我们使用模拟和真实数据进行了广泛的实验,以确认该方法的准确性和可扩展性。
该方法的源代码可通过 Conda 获得,并可在 https://github.com/KoslickiLab/YACHT 上找到。我们还在 https://github.com/KoslickiLab/YACHT-reproducibles 上提供了重现实验的代码。