最优传输在非靶向代谢组学数据自动配准中的应用。

Optimal transport for automatic alignment of untargeted metabolomic data.

机构信息

Nutrition and Metabolism Branch, International Agency for Research on Cancer, Lyon, France.

Massachusetts Institute of Technology, Department of Mathematics, Boston, United States.

出版信息

Elife. 2024 Jun 18;12:RP91597. doi: 10.7554/eLife.91597.

DOI:10.7554/eLife.91597

PMID:38896449

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11186628/

Abstract

Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here, we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.

摘要

通过液相色谱-质谱联用技术（LC-MS）进行的非靶向代谢组学分析可以在生物样本中测量大量代谢物，从而推进药物开发、疾病诊断和风险预测。然而，LC-MS 的低通量对生物标志物的发现、注释和实验比较构成了重大挑战，需要合并多个数据集。由于易受数据变化和超参数依赖性的影响，当前的数据池化方法存在实际限制。在这里，我们引入了 GromovMatcher，这是一种灵活且用户友好的算法，它使用最优传输自动组合 LC-MS 数据集。通过利用特征强度相关结构，GromovMatcher 与现有方法相比提供了更高的对齐准确性和鲁棒性。该算法可扩展到需要最小超参数调整的数千个特征。在非靶向代谢组学领域，用于验证对齐算法的手动整理数据集是有限的，因此我们开发了一种数据集拆分程序来生成验证数据集对，以测试 GromovMatcher 和其他方法生成的对齐。将我们的方法应用于肝癌和胰腺癌的实验患者研究，我们发现了与患者饮酒有关的共享代谢特征，这表明了 GromovMatcher 如何促进与几种癌症类型相关的生活方式风险因素相关的生物标志物的搜索。