优化模型性能与可解释性：在生物数据分类中的应用

Optimizing Model Performance and Interpretability: Application to Biological Data Classification.

作者信息

Huang Zhenyu, Mu Xuechen, Cao Yangkun, Chen Qiufen, Qiao Siyu, Shi Bocheng, Xiao Gangyi, Wang Yan, Xu Ying

机构信息

College of Computer Science and Technology, Jilin University, Changchun 130012, China.

Systems Biology Lab for Metabolic Reprogramming, Department of Human Genetics and Cell Biology, School of Medicine, Southern University of Science and Technology, Shenzhen 518055, China.

出版信息

Genes (Basel). 2025 Feb 28;16(3):297. doi: 10.3390/genes16030297.

DOI:10.3390/genes16030297

PMID:40149449

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11942234/

Abstract

This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. : In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. : The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. : The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. : This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.

摘要

本研究引入了一种新颖的框架，该框架同时解决了基于转录组数据的分类中性能准确性和结果可解释性方面的挑战。：在生物数据分类中，要同时实现高性能准确性和可解释性具有挑战性。本研究提出了一个框架，以解决基于转录组数据的分类中的这两个挑战。目标是选择特征、模型和一个元投票分类器，以优化分类性能和可解释性。：该框架由一个四步特征选择过程组成：（1）识别其酶基因表达能够区分具有不同标签样本的代谢途径，有助于可解释性；（2）选择其表达方差在很大程度上被基因表达矩阵的第一主成分所捕获的途径；（3）选择最小的基因集，其集体辨别能力涵盖基于途径的辨别能力的95%；（4）引入对抗样本以识别和过滤对这类样本敏感的基因。此外，对抗样本用于选择最优分类模型，并基于优化后的模型结果构建元投票分类器。：将该框架应用于两个癌症分类问题表明，在二元分类中，预测性能与全基因模型相当，F1分数差异在-5%至5%之间。在三元分类中，性能显著更好，F1分数差异在-2%至12%之间，同时还保持了所选特征基因的出色可解释性。：该框架有效地整合了特征选择、对抗样本处理和模型优化，为广泛的生物数据分类问题提供了一个有价值的工具。其平衡性能准确性和高可解释性的能力使其在计算生物学领域具有高度适用性。