Suppr超能文献

多类分类集成分类器中的拓扑嵌入与方向特征重要性

Topological embedding and directional feature importance in ensemble classifiers for multi-class classification.

作者信息

Rocha Liedl Eloisa, Yassin Shabeer Mohamed, Kasapi Melpomeni, Posma Joram M

机构信息

Section of Bioinformatics, Department of Metabolism, Digestion and Reproduction, Faculty of Medicine, Hammersmith Hospital Campus, Imperial College London, London, W12 0NN, United Kingdom.

Department of Surgery and Cancer, Faculty of Medicine, Hammersmith Hospital Campus, Imperial College London, London, W12 0NN, United Kingdom.

出版信息

Comput Struct Biotechnol J. 2024 Nov 13;23:4108-4123. doi: 10.1016/j.csbj.2024.11.013. eCollection 2024 Dec.

Abstract

Cancer is the second leading cause of disease-related death worldwide, and machine learning-based identification of novel biomarkers is crucial for improving early detection and treatment of various cancers. A key challenge in applying machine learning to high-dimensional data is deriving important features in an interpretable manner to provide meaningful insights into the underlying biological mechanisms We developed a class-based directional feature importance (CLIFI) metric for decision tree methods and demonstrated its use for The Cancer Genome Atlas proteomics data. The CLIFI metric was incorporated into four algorithms, Random Forest (RF), LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees (GBDTs), and a new extension incorporating the LAVA step into GBDTs (LAVABOOST). Both LAVA methods incorporate topological information from protein interactions into the decision function. The different models' performance in classifying 28 cancers resulted in F1-scores of 92.6% (RF), 92.0% (LAVASET), 89.3% (LAVABOOST) and 85.7% (GBDT), with no method outperforming all others for individual cancer type prediction. The CLIFI metric enables visualisation of the model's decision-making functions. The resulting CLIFI value distributions indicated heterogeneity in the expression of several proteins (MYH11, ER, BCL2) across different cancer types (including brain glioma, breast, kidney, thyroid and prostate cancer) aligning with the original raw expression data. In conclusion, we have developed an integrated, directional feature importance metric for multi-class decision tree-based classification models that facilitates interpretable feature importance assessment. The CLIFI metric can be combined with incorporating topological information into the decision functions of models to introduce inductive bias, enhancing interpretability.

摘要

癌症是全球疾病相关死亡的第二大主要原因,基于机器学习识别新型生物标志物对于改善各种癌症的早期检测和治疗至关重要。将机器学习应用于高维数据的一个关键挑战是以可解释的方式推导重要特征,以便为潜在的生物学机制提供有意义的见解。我们为决策树方法开发了一种基于类的方向特征重要性(CLIFI)指标,并展示了其在癌症基因组图谱蛋白质组学数据中的应用。CLIFI指标被纳入四种算法,即随机森林(RF)、树的潜在变量随机集成(LAVASET)和梯度提升决策树(GBDT),以及一种将LAVA步骤纳入GBDT的新扩展(LAVABOOST)。两种LAVA方法都将蛋白质相互作用的拓扑信息纳入决策函数。不同模型在对28种癌症进行分类时的性能导致F1分数分别为92.6%(RF)、92.0%(LAVASET)、89.3%(LAVABOOST)和85.7%(GBDT),在个体癌症类型预测中没有一种方法优于所有其他方法。CLIFI指标能够可视化模型的决策函数。由此产生的CLIFI值分布表明,几种蛋白质(MYH11、ER、BCL2)在不同癌症类型(包括脑胶质瘤、乳腺癌、肾癌、甲状腺癌和前列腺癌)中的表达存在异质性,这与原始的原始表达数据一致。总之,我们为基于多类决策树的分类模型开发了一种集成的、方向特征重要性指标,便于进行可解释的特征重要性评估。CLIFI指标可以与将拓扑信息纳入模型的决策函数相结合,以引入归纳偏差,增强可解释性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b499/11609472/cfcc86cb80d6/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验