• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

优化模型性能与可解释性:在生物数据分类中的应用

Optimizing Model Performance and Interpretability: Application to Biological Data Classification.

作者信息

Huang Zhenyu, Mu Xuechen, Cao Yangkun, Chen Qiufen, Qiao Siyu, Shi Bocheng, Xiao Gangyi, Wang Yan, Xu Ying

机构信息

College of Computer Science and Technology, Jilin University, Changchun 130012, China.

Systems Biology Lab for Metabolic Reprogramming, Department of Human Genetics and Cell Biology, School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China.

出版信息

Genes (Basel). 2025 Feb 28;16(3):297. doi: 10.3390/genes16030297.

DOI:10.3390/genes16030297
PMID:40149449
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11942234/
Abstract

This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. : In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. : The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. : The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. : This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.

摘要

本研究引入了一种新颖的框架,该框架同时解决了基于转录组数据的分类中性能准确性和结果可解释性方面的挑战。:在生物数据分类中,要同时实现高性能准确性和可解释性具有挑战性。本研究提出了一个框架,以解决基于转录组数据的分类中的这两个挑战。目标是选择特征、模型和一个元投票分类器,以优化分类性能和可解释性。:该框架由一个四步特征选择过程组成:(1)识别其酶基因表达能够区分具有不同标签样本的代谢途径,有助于可解释性;(2)选择其表达方差在很大程度上被基因表达矩阵的第一主成分所捕获的途径;(3)选择最小的基因集,其集体辨别能力涵盖基于途径的辨别能力的95%;(4)引入对抗样本以识别和过滤对这类样本敏感的基因。此外,对抗样本用于选择最优分类模型,并基于优化后的模型结果构建元投票分类器。:将该框架应用于两个癌症分类问题表明,在二元分类中,预测性能与全基因模型相当,F1分数差异在-5%至5%之间。在三元分类中,性能显著更好,F1分数差异在-2%至12%之间,同时还保持了所选特征基因的出色可解释性。:该框架有效地整合了特征选择、对抗样本处理和模型优化,为广泛的生物数据分类问题提供了一个有价值的工具。其平衡性能准确性和高可解释性的能力使其在计算生物学领域具有高度适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/90329e67d46b/genes-16-00297-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/e1bb94c59f24/genes-16-00297-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/d2894f392b0e/genes-16-00297-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/6b2563f7cb79/genes-16-00297-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/f00ccc90bc45/genes-16-00297-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/fb869e8ffca5/genes-16-00297-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/c2014c072afb/genes-16-00297-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/90329e67d46b/genes-16-00297-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/e1bb94c59f24/genes-16-00297-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/d2894f392b0e/genes-16-00297-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/6b2563f7cb79/genes-16-00297-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/f00ccc90bc45/genes-16-00297-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/fb869e8ffca5/genes-16-00297-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/c2014c072afb/genes-16-00297-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/08bf/11942234/90329e67d46b/genes-16-00297-g007.jpg

相似文献

1
Optimizing Model Performance and Interpretability: Application to Biological Data Classification.优化模型性能与可解释性:在生物数据分类中的应用
Genes (Basel). 2025 Feb 28;16(3):297. doi: 10.3390/genes16030297.
2
Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data.探索降维、迁移学习和正则化方法的组合,用于利用转录组数据预测二元表型。
BMC Bioinformatics. 2024 Apr 26;25(1):167. doi: 10.1186/s12859-024-05795-6.
3
Improving feature selection performance for classification of gene expression data using Harris Hawks optimizer with variable neighborhood learning.利用具有可变邻域学习的哈里斯鹰优化算法提高基因表达数据分类的特征选择性能。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab097.
4
Improving accuracy for cancer classification with a new algorithm for genes selection.利用新的基因选择算法提高癌症分类的准确性。
BMC Bioinformatics. 2012 Nov 13;13:298. doi: 10.1186/1471-2105-13-298.
5
Biologically weighted LASSO: enhancing functional interpretability in gene expression data analysis.基于生物学权重的 LASSO 模型:提升基因表达数据分析中功能可解释性。
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae605.
6
CARSVM: a class association rule-based classification framework and its application to gene expression data.CARSVM:一种基于类关联规则的分类框架及其在基因表达数据中的应用。
Artif Intell Med. 2008 Sep;44(1):7-25. doi: 10.1016/j.artmed.2008.05.002. Epub 2008 Jun 30.
7
A novel feature selection approach for biomedical data classification.一种用于生物医学数据分类的新特征选择方法。
J Biomed Inform. 2010 Feb;43(1):15-23. doi: 10.1016/j.jbi.2009.07.008. Epub 2009 Jul 30.
8
Optimizing cancer diagnosis: A hybrid approach of genetic operators and Sinh Cosh Optimizer for tumor identification and feature gene selection.优化癌症诊断:遗传算子和 Sinh Cosh 优化器的混合方法用于肿瘤识别和特征基因选择。
Comput Biol Med. 2024 Sep;180:108984. doi: 10.1016/j.compbiomed.2024.108984. Epub 2024 Aug 10.
9
Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study.无监督特征选择以识别冠心病患者队列机器学习中的重要国际疾病分类第十版(ICD - 10)和解剖治疗化学分类系统(ATC)编码:回顾性研究
JMIR Med Inform. 2024 Jul 26;12:e52896. doi: 10.2196/52896.
10
TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments.转基因选择器:利用一种Transformer方法从小型转录组数据集中挖掘植物对各种环境响应中的关键基因。
BMC Genomics. 2025 Mar 17;26(1):259. doi: 10.1186/s12864-025-11434-y.

本文引用的文献

1
A colorectal liver metastasis prediction model based on the combination of lipoprotein-associated phospholipase A2 and serum biomarker levels.基于脂蛋白相关磷脂酶A2与血清生物标志物水平联合的结直肠癌肝转移预测模型
Clin Chim Acta. 2025 Feb 15;568:120143. doi: 10.1016/j.cca.2025.120143. Epub 2025 Jan 16.
2
Dual gene set enrichment analysis (dualGSEA); an R function that enables more robust biological discovery and pre-clinical model alignment from transcriptomics data.双基因集富集分析(dualGSEA);一种R函数,可从转录组学数据中实现更可靠的生物学发现和临床前模型比对。
Sci Rep. 2024 Dec 4;14(1):30202. doi: 10.1038/s41598-024-80534-8.
3
The ROSMAP project: aging and neurodegenerative diseases through omic sciences.
ROSMAP项目:通过组学科学研究衰老与神经退行性疾病
Front Neuroinform. 2024 Sep 16;18:1443865. doi: 10.3389/fninf.2024.1443865. eCollection 2024.
4
Elucidating the Functional Roles of Long Non-Coding RNAs in Alzheimer's Disease.阐明长链非编码 RNA 在阿尔茨海默病中的功能作用。
Int J Mol Sci. 2024 Aug 25;25(17):9211. doi: 10.3390/ijms25179211.
5
Graph machine learning for integrated multi-omics analysis.图机器学习在整合多组学分析中的应用。
Br J Cancer. 2024 Jul;131(2):205-211. doi: 10.1038/s41416-024-02706-7. Epub 2024 May 10.
6
Reactive Oxygen Species Modulation in the Current Landscape of Anticancer Therapies.活性氧物种调节在当前的癌症治疗领域。
Antioxid Redox Signal. 2024 Aug;41(4-6):322-341. doi: 10.1089/ars.2023.0445. Epub 2024 Apr 1.
7
SynergyX: a multi-modality mutual attention network for interpretable drug synergy prediction.SynergyX:一种用于可解释药物协同作用预测的多模态互注意力网络。
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae015.
8
Deep centroid: a general deep cascade classifier for biomedical omics data classification.深质心:一种用于生物医学组学数据分类的通用深度级联分类器。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae039.
9
Multi-omics fusion with soft labeling for enhanced prediction of distant metastasis in nasopharyngeal carcinoma patients after radiotherapy.多组学融合与软标记增强鼻咽癌患者放疗后远处转移预测。
Comput Biol Med. 2024 Jan;168:107684. doi: 10.1016/j.compbiomed.2023.107684. Epub 2023 Nov 11.
10
Quantifying synergistic interactions: a meta-analysis of joint effects of chemical and parasitic stressors.量化协同相互作用:化学和寄生虫胁迫联合效应的荟萃分析。
Sci Rep. 2023 Aug 22;13(1):13641. doi: 10.1038/s41598-023-40847-6.