Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
Faculty of Medical & Health Sciences, Tel Aviv University, Tel Aviv, Israel.
mBio. 2024 Sep 11;15(9):e0115024. doi: 10.1128/mbio.01150-24. Epub 2024 Aug 20.
The human gut microbiome significantly impacts health, prompting a rise in longitudinal studies that capture microbiome samples at multiple time points. Such studies allow researchers to characterize microbiome changes over time, but importantly, also present major analytical challenges due to incomplete or irregular sampling. To address this challenge, longitudinal microbiome studies often employ various interpolation methods, aiming to infer missing microbiome data. However, to date, a comprehensive assessment of such microbiome interpolation techniques, as well as best practice guidelines for interpolating microbiome data, is still lacking. This work aims to fill this gap, rigorously implementing and systematically evaluating a large array of interpolation methods, spanning several different categories, for longitudinal microbiome interpolation. To assess each method and its ability to accurately infer microbiome composition at missing time points, we used three longitudinal microbiome data sets that follow individuals over a long period of time and a leave-one-out approach. Overall, our analysis demonstrated that the K-nearest neighbors algorithm consistently outperforms other methods in interpolation accuracy, yet, accuracy varied widely across data sets, individuals, and time. Factors such as microbiome stability, sample size, and the time gap between interpolated and adjacent samples significantly influenced accuracy, allowing us to develop a model for predicting the expected interpolation accuracy at a missing time point. Our findings, combined, suggest that accurate interpolation in longitudinal microbiome data is feasible, especially in dense cohorts. Furthermore, using our predictive model, future studies can interpolate data only in time points where the expected interpolation accuracy is high.
Since missing samples are common in longitudinal microbiome dataset due to inconsistent collection practices, it is important to evaluate and benchmark different interpolation methods for predicting microbiome composition in such samples and facilitate downstream analysis. Our study rigorously evaluated several such methods and identified the K-nearest neighbors approach as particularly effective for this task. The study also notes significant variability in interpolation accuracy among individuals, influenced by factors such as age, sample size, and sampling frequency. Furthermore, we developed a predictive model for estimating interpolation accuracy at a specific time point, enhancing the reliability of such analyses in future studies. Combined, our study, thus, provides critical insights and tools that enhance the accuracy and reliability of data interpolation methods in the growing field of longitudinal microbiome research.
人类肠道微生物组对健康有重大影响,促使人们进行了越来越多的纵向研究,以在多个时间点采集微生物组样本。这些研究使研究人员能够描述随时间推移的微生物组变化,但重要的是,由于采样不完整或不规律,也带来了重大的分析挑战。为了应对这一挑战,纵向微生物组研究通常采用各种插值方法,旨在推断缺失的微生物组数据。然而,迄今为止,仍然缺乏对这些微生物组插值技术的全面评估,以及插值微生物组数据的最佳实践指南。这项工作旨在填补这一空白,严格实施和系统评估大量的插值方法,涵盖了几个不同的类别,用于纵向微生物组插值。为了评估每种方法及其在推断缺失时间点上的微生物组组成的准确性,我们使用了三个长期跟踪个体的纵向微生物组数据集,并采用了留一法。总的来说,我们的分析表明,K 近邻算法在插值准确性方面始终优于其他方法,但准确性在数据集、个体和时间上差异很大。微生物组稳定性、样本量以及插值和相邻样本之间的时间间隔等因素极大地影响了准确性,使我们能够开发出一种模型,以预测在缺失时间点的预期插值准确性。综合来看,这些发现表明,在纵向微生物组数据中进行准确的插值是可行的,尤其是在密集的队列中。此外,使用我们的预测模型,未来的研究可以仅在预期插值准确性高的时间点上进行数据插值。
由于不一致的采集实践,在纵向微生物组数据集中,缺失样本很常见,因此评估和基准测试不同的插值方法对于预测此类样本中的微生物组组成以及促进下游分析非常重要。我们的研究严格评估了几种这样的方法,并确定 K 近邻方法特别适用于这项任务。该研究还指出,个体之间的插值准确性存在显著差异,这受到年龄、样本量和采样频率等因素的影响。此外,我们还开发了一种模型,用于估计特定时间点的插值准确性,从而提高了未来研究中此类分析的可靠性。综合来看,我们的研究提供了关键的见解和工具,增强了在不断发展的纵向微生物组研究领域中数据插值方法的准确性和可靠性。