Santana Everton, Ibrahimi Eliana, Ntalianis Evangelos, Cauwenberghs Nicholas, Kuznetsova Tatiana
Research Unit Hypertension and Cardiovascular Epidemiology, KU Leuven Department of Cardiovascular Sciences, University of Leuven, 3000 Leuven, Belgium.
Department of Biology, University of Tirana, 1001 Tirana, Albania.
Int J Mol Sci. 2024 Nov 30;25(23):12905. doi: 10.3390/ijms252312905.
Metabolomic data often present challenges due to high dimensionality, collinearity, and variability in metabolite concentrations. Machine learning (ML) application in metabolomic analyses is enabling the extraction of meaningful information from complex data. Bringing together domain-specific knowledge from metabolomics with explainable ML methods can refine the predictive performance and interpretability of models used in atherosclerosis research. In this work, we aimed to identify the most impactful metabolites associated with the presence of atherosclerotic cardiovascular disease (ASCVD) in cross-sectional case-control studies using explainable ML methods integrated with metabolomics domain knowledge. For this, a subset from the FLEMENGHO cohort with metabolomic data available was used as the training cohort, including 63 patients with a history of ASCVD and 52 non-smoking controls matched by age, sex, and body mass index from the same population. First, Partial Least Squares Discriminant Analysis (PLS-DA) was applied for dimensionality reduction. The selected metabolites' correlations were analyzed by considering their chemical categorization. Then, eXtreme Gradient Boosting (XGBoost) was used to identify metabolites that characterize ASCVD. Next, the selected metabolites were evaluated in an external cohort to determine their effectiveness in distinguishing between cases and controls. A total of 56 metabolites were selected for ASCVD discrimination using PLS-DA. The primary identified metabolites' superclasses included lipids, organic acids, and organic oxygen compounds. Upon integrating these metabolites with the XGBoost model, the classification yielded a test area under the curve (AUC) of 0.75. SHAP analyses ranked cholesterol, 3-methylhistidine, and glucuronic acid among the most impactful features and showed the diversity of metabolites considered for building the ASCVD discriminator. Also using XGBoost, the selected metabolites achieved an AUC of 0.93 in an independent external validation cohort. In conclusion, the combination of different metabolites has the potential to build classifiers for ASCVD. Integrating metabolite categorization within the SHAP analysis further enhanced the interpretability of the model, offering insights into metabolite-specific contributions to ASCVD risk.
由于代谢组学数据具有高维度、共线性以及代谢物浓度的变异性等特点,常常带来挑战。机器学习(ML)在代谢组学分析中的应用能够从复杂数据中提取有意义的信息。将代谢组学的领域特定知识与可解释的ML方法相结合,可以提升动脉粥样硬化研究中所用模型的预测性能和可解释性。在这项工作中,我们旨在通过将可解释的ML方法与代谢组学领域知识相结合,在横断面病例对照研究中识别与动脉粥样硬化性心血管疾病(ASCVD)存在相关的最具影响力的代谢物。为此,将来自FLEMENGHO队列且有代谢组学数据的一个子集用作训练队列,其中包括63例有ASCVD病史的患者以及52名来自同一人群的按年龄、性别和体重指数匹配的非吸烟对照。首先,应用偏最小二乘判别分析(PLS-DA)进行降维。通过考虑所选代谢物的化学分类来分析它们的相关性。然后,使用极端梯度提升(XGBoost)来识别表征ASCVD的代谢物。接下来,在一个外部队列中对所选代谢物进行评估,以确定它们在区分病例和对照方面的有效性。使用PLS-DA共选择了56种代谢物用于ASCVD判别。初步鉴定出的代谢物超类包括脂质、有机酸和有机氧化合物。将这些代谢物与XGBoost模型整合后,分类得到的曲线下面积(AUC)为0.75。SHAP分析将胆固醇、3-甲基组氨酸和葡萄糖醛酸列为最具影响力的特征,并展示了用于构建ASCVD判别器的代谢物的多样性。同样使用XGBoost,所选代谢物在一个独立的外部验证队列中实现了0.93的AUC。总之,不同代谢物的组合有潜力构建ASCVD分类器。在SHAP分析中整合代谢物分类进一步增强了模型的可解释性,为代谢物对ASCVD风险的特定贡献提供了见解。