Fouquier Jennifer, Stanislawski Maggie, O'Connor John, Scadden Ashley, Lozupone Catherine
Department of Biomedical Informatics, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO.
bioRxiv. 2024 Aug 15:2024.03.20.585968. doi: 10.1101/2024.03.20.585968.
Longitudinal microbiome studies (LMS) are increasingly common but have analytic challenges including non-independent data requiring mixed-effects models and large amounts of data that motivate exploratory analysis to identify factors related to outcome variables. Although change analysis (i.e. calculating deltas between values at different timepoints) can be powerful, how to best conduct these analyses is not always clear. For example, observational LMS measurements show natural fluctuations, so baseline might not be a reference of primary interest; whereas, for interventional LMS, baseline is a key reference point, often indicating the start of treatment.
To address these challenges, we developed a feature selection workflow for cross-sectional and LMS that supports numerical and categorical data called EXPLANA (EXPLoratory ANAlysis). Machine-learning methods were combined with different types of change calculations and downstream interpretation methods to identify statistically meaningful variables and explain their relationship to outcomes. EXPLANA generates an interactive report that textually and graphically summarizes methods and results. EXPLANA had good performance on simulated data, with an average area under the curve (AUC) of 0.91 (range: 0.79-1.0, SD = 0.05), outperformed an existing tool (AUC: 0.95 vs. 0.56), and identified novel order-dependent categorical feature changes. EXPLANA is broadly applicable and simplifies analytics for identifying features related to outcomes of interest.
纵向微生物组研究(LMS)越来越普遍,但存在分析挑战,包括需要混合效应模型处理的非独立数据以及大量促使进行探索性分析以识别与结果变量相关因素的数据。尽管变化分析(即计算不同时间点值之间的差值)可能很有效,但如何最好地进行这些分析并不总是明确的。例如,观察性LMS测量显示自然波动,因此基线可能不是主要关注的参考;而对于干预性LMS,基线是一个关键参考点,通常指示治疗开始。
为应对这些挑战,我们开发了一种用于横断面和LMS的特征选择工作流程,该流程支持数值和分类数据,称为EXPLANA(探索性分析)。机器学习方法与不同类型的变化计算及下游解释方法相结合,以识别具有统计学意义的变量并解释它们与结果的关系。EXPLANA生成一份交互式报告,以文本和图形方式总结方法和结果。EXPLANA在模拟数据上表现良好,曲线下面积(AUC)平均为0.91(范围:0.79 - 1.0,标准差 = 0.05),优于现有工具(AUC:0.95对0.56),并识别出与顺序相关的新型分类特征变化。EXPLANA具有广泛的适用性,简化了用于识别与感兴趣结果相关特征的分析。