• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

比较五种监督特征选择算法,这些算法可从癌症的多组学数据中得到顶级特征和基因特征。

Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer.

机构信息

Department of Computer Science and Engineering, Aliah University, Kolkata, West Bengal, 700160, India.

Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.

出版信息

BMC Bioinformatics. 2022 Apr 28;23(Suppl 3):153. doi: 10.1186/s12859-022-04678-y.

DOI:10.1186/s12859-022-04678-y
PMID:35484501
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9052461/
Abstract

BACKGROUND

As many complex omics data have been generated during the last two decades, dimensionality reduction problem has been a challenging issue in better mining such data. The omics data typically consists of many features. Accordingly, many feature selection algorithms have been developed. The performance of those feature selection methods often varies by specific data, making the discovery and interpretation of results challenging.

METHODS AND RESULTS

In this study, we performed a comprehensive comparative study of five widely used supervised feature selection methods (mRMR, INMIFS, DFS, SVM-RFE-CBR and VWMRmR) for multi-omics datasets. Specifically, we used five representative datasets: gene expression (Exp), exon expression (ExpExon), DNA methylation (hMethyl27), copy number variation (Gistic2), and pathway activity dataset (Paradigm IPLs) from a multi-omics study of acute myeloid leukemia (LAML) from The Cancer Genome Atlas (TCGA). The different feature subsets selected by the aforesaid five different feature selection algorithms are assessed using three evaluation criteria: (1) classification accuracy (Acc), (2) representation entropy (RE) and (3) redundancy rate (RR). Four different classifiers, viz., C4.5, NaiveBayes, KNN, and AdaBoost, were used to measure the classification accuary (Acc) for each selected feature subset. The VWMRmR algorithm obtains the best Acc for three datasets (ExpExon, hMethyl27 and Paradigm IPLs). The VWMRmR algorithm offers the best RR (obtained using normalized mutual information) for three datasets (Exp, Gistic2 and Paradigm IPLs), while it gives the best RR (obtained using Pearson correlation coefficient) for two datasets (Gistic2 and Paradigm IPLs). It also obtains the best RE for three datasets (Exp, Gistic2 and Paradigm IPLs). Overall, the VWMRmR algorithm yields best performance for all three evaluation criteria for majority of the datasets. In addition, we identified signature genes using supervised learning collected from the overlapped top feature set among five feature selection methods. We obtained a 7-gene signature (ZMIZ1, ENG, FGFR1, PAWR, KRT17, MPO and LAT2) for EXP, a 9-gene signature for ExpExon, a 7-gene signature for hMethyl27, one single-gene signature (PIK3CG) for Gistic2 and a 3-gene signature for Paradigm IPLs.

CONCLUSION

We performed a comprehensive comparison of the performance evaluation of five well-known feature selection methods for mining features from various high-dimensional datasets. We identified signature genes using supervised learning for the specific omic data for the disease. The study will help incorporate higher order dependencies among features.

摘要

背景

在过去的二十年中,已经产生了许多复杂的组学数据,降维问题成为了更好地挖掘这些数据的一个具有挑战性的问题。组学数据通常由许多特征组成。因此,已经开发了许多特征选择算法。这些特征选择方法的性能通常因特定数据而异,这使得结果的发现和解释具有挑战性。

方法和结果

在这项研究中,我们对五种广泛使用的有监督特征选择方法(mRMR、INMIFS、DFS、SVM-RFE-CBR 和 VWMRmR)进行了全面的比较研究,用于多组学数据集。具体来说,我们使用了五个代表性数据集:来自癌症基因组图谱(TCGA)的急性髓系白血病(LAML)的多组学研究中的基因表达(Exp)、外显子表达(ExpExon)、DNA 甲基化(hMethyl27)、拷贝数变异(Gistic2)和途径活性数据集(Paradigm IPLs)。使用上述五种不同特征选择算法选择的不同特征子集使用三个评估标准进行评估:(1)分类准确率(Acc),(2)表示熵(RE)和(3)冗余率(RR)。使用四种不同的分类器,即 C4.5、朴素贝叶斯、KNN 和 AdaBoost,测量每个选定特征子集的分类准确率(Acc)。VWMRmR 算法在三个数据集(ExpExon、hMethyl27 和 Paradigm IPLs)中获得最佳 Acc。VWMRmR 算法在三个数据集(Exp、Gistic2 和 Paradigm IPLs)中提供最佳 RR(使用归一化互信息获得),而在两个数据集(Gistic2 和 Paradigm IPLs)中提供最佳 RR(使用皮尔逊相关系数获得)。它还在三个数据集(Exp、Gistic2 和 Paradigm IPLs)中获得最佳 RE。总体而言,VWMRmR 算法在大多数数据集的所有三个评估标准中均具有最佳性能。此外,我们使用来自五个特征选择方法中重叠的顶级特征集的监督学习来识别特征。我们获得了一个 7 个基因特征(ZMIZ1、ENG、FGFR1、PAWR、KRT17、MPO 和 LAT2)用于 EXP,一个 9 个基因特征用于 ExpExon,一个 7 个基因特征用于 hMethyl27,一个单基因特征(PIK3CG)用于 Gistic2,以及一个 3 个基因特征用于 Paradigm IPLs。

结论

我们对五种知名特征选择方法的性能评估进行了全面比较,用于挖掘来自各种高维数据集的特征。我们使用监督学习为特定的组学数据识别疾病的特征基因。该研究将有助于整合特征之间的更高阶相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/9052461/57218b9878d7/12859_2022_4678_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/9052461/6f9cbb7e0e11/12859_2022_4678_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/9052461/57218b9878d7/12859_2022_4678_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/9052461/6f9cbb7e0e11/12859_2022_4678_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/97ca/9052461/57218b9878d7/12859_2022_4678_Fig2_HTML.jpg

相似文献

1
Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer.比较五种监督特征选择算法,这些算法可从癌症的多组学数据中得到顶级特征和基因特征。
BMC Bioinformatics. 2022 Apr 28;23(Suppl 3):153. doi: 10.1186/s12859-022-04678-y.
2
Unsupervised Feature Selection Using an Integrated Strategy of Hierarchical Clustering With Singular Value Decomposition: An Integrative Biomarker Discovery Method With Application to Acute Myeloid Leukemia.基于层次聚类和奇异值分解的集成策略的无监督特征选择:一种集成生物标志物发现方法及其在急性髓系白血病中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1354-1364. doi: 10.1109/TCBB.2021.3110989. Epub 2022 Jun 3.
3
Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data.基于多组学数据预测卵巢癌生存的最小冗余最大相关性多视图特征选择。
BMC Med Genomics. 2018 Sep 14;11(Suppl 3):71. doi: 10.1186/s12920-018-0388-0.
4
Computer-assisted lip diagnosis on Traditional Chinese Medicine using multi-class support vector machines.基于多类支持向量机的中医唇诊计算机辅助诊断。
BMC Complement Altern Med. 2012 Aug 16;12:127. doi: 10.1186/1472-6882-12-127.
5
Machine learning combining multi-omics data and network algorithms identifies adrenocortical carcinoma prognostic biomarkers.结合多组学数据和网络算法的机器学习可识别肾上腺皮质癌预后生物标志物。
Front Mol Biosci. 2023 Nov 6;10:1258902. doi: 10.3389/fmolb.2023.1258902. eCollection 2023.
6
Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.机器学习中特征选择的最佳评分对及其在癌症预后预测中的应用。
BMC Bioinformatics. 2011 Sep 23;12:375. doi: 10.1186/1471-2105-12-375.
7
Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios.基于组学的分类场景中特征选择的有监督相关性-冗余评估。
J Biomed Inform. 2023 Aug;144:104457. doi: 10.1016/j.jbi.2023.104457. Epub 2023 Jul 23.
8
Benchmark study of feature selection strategies for multi-omics data.基于多组学数据的特征选择策略基准研究。
BMC Bioinformatics. 2022 Oct 5;23(1):412. doi: 10.1186/s12859-022-04962-x.
9
Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture.使用混合特征选择方法和深度学习架构增强从基因表达谱预测浸润性导管癌乳腺癌分期的能力。
Med Biol Eng Comput. 2023 Nov;61(11):2895-2919. doi: 10.1007/s11517-023-02892-1. Epub 2023 Aug 2.
10
Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm.基于改进的鹽蝽群算法的基因表达数据分类的两阶段特征选择
Math Biosci Eng. 2022 Sep 19;19(12):13747-13781. doi: 10.3934/mbe.2022641.

引用本文的文献

1
Feature Ranking on Small Samples: A Bayes-Based Approach.小样本特征排序:一种基于贝叶斯的方法。
Entropy (Basel). 2025 Jul 22;27(8):773. doi: 10.3390/e27080773.
2
Deep learning radiomics based on MRI for differentiating tongue cancer T - staging.基于磁共振成像的深度学习影像组学用于鉴别舌癌T分期
BMC Cancer. 2025 Aug 22;25(1):1358. doi: 10.1186/s12885-025-14627-6.
3
DOMSCNet: a deep learning model for the classification of stomach cancer using multi-layer omics data.DOMSCNet:一种使用多层组学数据进行胃癌分类的深度学习模型。

本文引用的文献

1
A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems.基于图拓扑的新型 GO 相似性度量在多组学数据特征检测中的应用及其在其他问题中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):773-785. doi: 10.1109/TCBB.2020.3020537. Epub 2022 Apr 1.
2
A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data.基于 DNA 甲基化和基因表达数据的线性回归和深度学习方法在癌症中检测可靠的遗传改变。
Genes (Basel). 2020 Aug 12;11(8):931. doi: 10.3390/genes11080931.
3
Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf115.
4
Analyzing Wav2Vec 1.0 Embeddings for Cross-Database Parkinson's Disease Detection and Speech Features Extraction.分析 Wav2Vec 1.0 嵌入以进行跨数据库帕金森病检测和语音特征提取。
Sensors (Basel). 2024 Aug 26;24(17):5520. doi: 10.3390/s24175520.
5
A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis.一篇关于高通量测序数据分析中特征选择和特征提取进展的综述。
Funct Integr Genomics. 2024 Aug 19;24(5):139. doi: 10.1007/s10142-024-01415-x.
6
Cross-attention enables deep learning on limited omics-imaging-clinical data of 130 lung cancer patients.跨注意力使深度学习能够利用 130 名肺癌患者的有限组学-影像-临床数据。
Cell Rep Methods. 2024 Jul 15;4(7):100817. doi: 10.1016/j.crmeth.2024.100817. Epub 2024 Jul 8.
7
Logistic PCA explains differences between genome-scale metabolic models in terms of metabolic pathways.逻辑主成分分析根据代谢途径解释了基因组规模代谢模型之间的差异。
PLoS Comput Biol. 2024 Jun 24;20(6):e1012236. doi: 10.1371/journal.pcbi.1012236. eCollection 2024 Jun.
8
ZMIZ1 Regulates Proliferation, Autophagy and Apoptosis of Colon Cancer Cells by Mediating Ubiquitin-Proteasome Degradation of SIRT1.ZMIZ1 通过介导 SIRT1 的泛素蛋白酶体降解来调控结肠癌细胞的增殖、自噬和凋亡。
Biochem Genet. 2024 Aug;62(4):3245-3259. doi: 10.1007/s10528-023-10573-9. Epub 2024 Jan 12.
9
GradWise: A Novel Application of a Rank-Based Weighted Hybrid Filter and Embedded Feature Selection Method for Glioma Grading with Clinical and Molecular Characteristics.GradWise:一种基于排名的加权混合滤波器和嵌入式特征选择方法在结合临床和分子特征的神经胶质瘤分级中的新应用。
Cancers (Basel). 2023 Sep 19;15(18):4628. doi: 10.3390/cancers15184628.
10
Review of feature selection approaches based on grouping of features.基于特征分组的特征选择方法综述。
PeerJ. 2023 Jul 17;11:e15666. doi: 10.7717/peerj.15666. eCollection 2023.
Visualizing and interpreting cancer genomics data via the Xena platform.
通过Xena平台可视化和解读癌症基因组学数据。
Nat Biotechnol. 2020 Jun;38(6):675-678. doi: 10.1038/s41587-020-0546-8.
4
MicroRNA and transcription factor co-regulatory networks and subtype classification of seminoma and non-seminoma in testicular germ cell tumors.微小 RNA 和转录因子的共同调控网络以及睾丸生殖细胞肿瘤中精原细胞瘤和非精原细胞瘤的亚型分类。
Sci Rep. 2020 Jan 21;10(1):852. doi: 10.1038/s41598-020-57834-w.
5
Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data.基于图和规则的学习算法:使用基因组数据对癌症类型分类和预后的应用的全面综述。
Brief Bioinform. 2020 Mar 23;21(2):368-394. doi: 10.1093/bib/bby120.
6
Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm.使用帕累托最优聚类算法从RNA测序数据中识别基因特征。
BMC Syst Biol. 2018 Dec 21;12(Suppl 8):126. doi: 10.1186/s12918-018-0650-2.
7
ConGEMs: Condensed Gene Co-Expression Module Discovery Through Rule-Based Clustering and Its Application to Carcinogenesis.ConGEMs:通过基于规则的聚类发现浓缩基因共表达模块及其在致癌作用中的应用
Genes (Basel). 2017 Dec 28;9(1):7. doi: 10.3390/genes9010007.
8
Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis.整合多种数据源进行组合标记物发现:在肿瘤发生中的研究。
IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):673-687. doi: 10.1109/TCBB.2016.2636207. Epub 2016 Dec 6.
9
Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data.基于最大相关最小冗余特征选择的多组学数据表观遗传生物标志物识别
IEEE Trans Nanobioscience. 2017 Jan;16(1):3-10. doi: 10.1109/TNB.2017.2650217. Epub 2017 Jan 9.
10
A Survey and Comparative Study of Statistical Tests for Identifying Differential Expression from Microarray Data.用于从微阵列数据中识别差异表达的统计检验的调查与比较研究
IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):95-115. doi: 10.1109/TCBB.2013.147.