使用图形集成特征选择提高医学数据集的性能和可解释性。

Improving the performance and interpretability on medical datasets using graphical ensemble feature selection.

机构信息

Network Science Institute, Northeastern University, Boston, MA 02115, United States.

Scipher Medicine, Waltham, MA 02453, United States.

出版信息

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae341.

DOI:10.1093/bioinformatics/btae341

PMID:38837347

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11187494/

Abstract

MOTIVATION

A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features.

RESULTS

Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML.

AVAILABILITY AND IMPLEMENTATION

https://github.com/ebattistella/auto_machine_learning.

摘要

动机

在医学数据集上使用机器学习（ML）的主要障碍是大量变量与小样本量之间的差异。虽然已经提出了多种特征选择技术来避免由此产生的过拟合，但总体集成技术提供了最佳的选择稳健性。然而，目前旨在组合不同算法的方法通常未能利用其组件确定的依赖关系。在这里，我们提出了基于图论的集成特征选择技术 Graphical Ensembling (GE)，旨在提高所选特征的稳定性和相关性。

结果

我们依赖于四个数据集，展示了 GE 如何通过选择更少的特征来提高分类性能。例如，在类风湿关节炎患者分层中，GE 比基线方法的平衡准确率高 9%，同时依赖更少的特征。我们使用亚细胞网络的数据来表明所选特征（蛋白质）更接近已知的疾病基因，并且发现的生物学机制更加多样化。通过成功解决生物变量之间的复杂相关性，我们预计 GE 将提高 ML 在医学中的应用。

可用性和实现

https://github.com/ebattistella/auto_machine_learning。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/906d/11187494/fdafc63f5b40/btae341f1.jpg

相似文献

Improving the performance and interpretability on medical datasets using graphical ensemble feature selection.使用图形集成特征选择提高医学数据集的性能和可解释性。

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae341.

Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型，对于使用可穿戴设备进行压力预测具有良好的泛化能力。

J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.

Fast and interpretable genomic data analysis using multiple approximate kernel learning.使用多种近似核学习进行快速且可解释的基因组数据分析。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i77-i83. doi: 10.1093/bioinformatics/btac241.

HCS-hierarchical algorithm for simulation of omics datasets.用于组学数据集模拟的 HCS 层次算法。

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii98-ii104. doi: 10.1093/bioinformatics/btae392.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.基于集成特征选择方法的癌症诊断稳健生物标志物识别。

Bioinformatics. 2010 Feb 1;26(3):392-8. doi: 10.1093/bioinformatics/btp630. Epub 2009 Nov 25.

dRFEtools: dynamic recursive feature elimination for omics.dRFEtools：组学的动态递归特征消除。

Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad513.

Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合，以预测放射性肺损伤。

Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.

Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism.Glypred：基于 CCU-LightGBM-BiLSTM 框架与多头注意力机制的赖氨酸糖基化位点预测

J Chem Inf Model. 2024 Aug 26;64(16):6699-6711. doi: 10.1021/acs.jcim.4c01034. Epub 2024 Aug 9.

Comparative performance analysis of binary variants of FOX optimization algorithm with half-quadratic ensemble ranking method for thyroid cancer detection.基于半二次集成排序法的 FOX 优化算法二进制变体在甲状腺癌检测中的比较性能分析。

Sci Rep. 2023 Nov 10;13(1):19598. doi: 10.1038/s41598-023-46865-8.

Correlation-Based Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation Neural Network.基于生物启发算法的相关性集成特征选择和反向传播神经网络分类。

Comput Math Methods Med. 2019 Sep 23;2019:7398307. doi: 10.1155/2019/7398307. eCollection 2019.

本文引用的文献

Machine learning prediction of mortality in Acute Myocardial Infarction.机器学习预测急性心肌梗死患者的死亡率。

BMC Med Inform Decis Mak. 2023 Apr 18;23(1):70. doi: 10.1186/s12911-023-02168-6.

Multi-Omic Biomarkers for Patient Stratification in Sjogren's Syndrome-A Review of the Literature.干燥综合征患者分层的多组学生物标志物——文献综述

Biomedicines. 2022 Jul 22;10(8):1773. doi: 10.3390/biomedicines10081773.

COMBING: Clustering in Oncology for Mathematical and Biological Identification of Novel Gene Signatures.梳理：肿瘤学中的聚类分析用于新型基因特征的数学与生物学识别

IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3317-3331. doi: 10.1109/TCBB.2021.3123910. Epub 2022 Dec 8.

Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data.大型临床数据集的轨迹、分支和伪时间：在心肌梗死和糖尿病数据中的应用。

Gigascience. 2020 Nov 25;9(11). doi: 10.1093/gigascience/giaa128.

AI-driven quantification, staging and outcome prediction of COVID-19 pneumonia.AI 驱动的 COVID-19 肺炎量化、分期和预后预测。

Med Image Anal. 2021 Jan;67:101860. doi: 10.1016/j.media.2020.101860. Epub 2020 Oct 15.

Population Graph-Based Multi-Model Ensemble Method for Diagnosing Autism Spectrum Disorder.基于人口统计学图形的多模型集成方法用于自闭症谱系障碍的诊断。

Sensors (Basel). 2020 Oct 22;20(21):6001. doi: 10.3390/s20216001.

Secure and Robust Machine Learning for Healthcare: A Survey.用于医疗保健的安全可靠机器学习：一项综述。

IEEE Rev Biomed Eng. 2021;14:156-180. doi: 10.1109/RBME.2020.3013489. Epub 2021 Jan 22.

Consensus features nested cross-validation.共识特征嵌套交叉验证。

Bioinformatics. 2020 May 1;36(10):3093-3098. doi: 10.1093/bioinformatics/btaa046.

Scaling tree-based automated machine learning to biomedical big data with a feature set selector.使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。

Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.

The Immune Landscape of Cancer.癌症的免疫全景。

Immunity. 2018 Apr 17;48(4):812-830.e14. doi: 10.1016/j.immuni.2018.03.023. Epub 2018 Apr 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用图形集成特征选择提高医学数据集的性能和可解释性。

Improving the performance and interpretability on medical datasets using graphical ensemble feature selection.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献