Eissa Tarek, Huber Marinus, Obermayer-Pietsch Barbara, Linkohr Birgit, Peters Annette, Fleischmann Frank, Žigman Mihaela
Chair of Experimental Physics - Laser Physics, Ludwig-Maximilians-Universität München, Bavaria 85748, Germany.
Laboratory for Attosecond Physics, Max Planck Institute of Quantum Optics, Bavaria 85748, Germany.
PNAS Nexus. 2024 Oct 15;3(10):pgae449. doi: 10.1093/pnasnexus/pgae449. eCollection 2024 Oct.
Molecular analytics increasingly utilize machine learning (ML) for predictive modeling based on data acquired through molecular profiling technologies. However, developing robust models that accurately capture physiological phenotypes is challenged by the dynamics inherent to biological systems, variability stemming from analytical procedures, and the resource-intensive nature of obtaining sufficiently representative datasets. Here, we propose and evaluate a new method: Contextual Out-of-Distribution Integration (CODI). Based on experimental observations, CODI generates synthetic data that integrate unrepresented sources of variation encountered in real-world applications into a given molecular fingerprint dataset. By augmenting a dataset with out-of-distribution variance, CODI enables an ML model to better generalize to samples beyond the seed training data, reducing the need for extensive experimental data collection. Using three independent longitudinal clinical studies and a case-control study, we demonstrate CODI's application to several classification tasks involving vibrational spectroscopy of human blood. We showcase our approach's ability to enable personalized fingerprinting for multiyear longitudinal molecular monitoring and enhance the robustness of trained ML models for improved disease detection. Our comparative analyses reveal that incorporating CODI into the classification workflow consistently leads to increased robustness against data variability and improved predictive accuracy.
分子分析越来越多地利用机器学习(ML),基于通过分子谱分析技术获取的数据进行预测建模。然而,开发能够准确捕捉生理表型的稳健模型面临着生物系统固有的动态性、分析程序产生的变异性以及获取足够有代表性的数据集所需的资源密集性等挑战。在此,我们提出并评估一种新方法:上下文分布外整合(CODI)。基于实验观察,CODI生成合成数据,将实际应用中遇到的未被表征的变异来源整合到给定的分子指纹数据集中。通过用分布外方差扩充数据集,CODI使ML模型能够更好地推广到种子训练数据之外的样本,减少了对广泛实验数据收集的需求。利用三项独立的纵向临床研究和一项病例对照研究,我们展示了CODI在涉及人体血液振动光谱的多个分类任务中的应用。我们展示了我们的方法能够实现多年纵向分子监测的个性化指纹识别,并增强训练后的ML模型的稳健性以改善疾病检测。我们的比较分析表明,将CODI纳入分类工作流程始终会提高对数据变异性的稳健性并提高预测准确性。