ECFS-DEA：基于集成分类器的特征选择方法，用于表达谱上的差异表达分析。

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles.

机构信息

College of Information and Computer Engineering, Northeast Forestry University, No.26 Hexing Road, Harbin, 150040, China.

Department of Neurology, The 2nd Affiliated Hospital of Harbin Medical University, No. 246 Xuefu Road, Harbin, 150086, China.

出版信息

BMC Bioinformatics. 2020 Feb 5;21(1):43. doi: 10.1186/s12859-020-3388-y.

DOI:10.1186/s12859-020-3388-y

PMID:32024464

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7003361/

Abstract

BACKGROUND

Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples.

RESULTS

Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature.

CONCLUSIONS

Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.

摘要

背景

为了识别能够最好地区分不同类别样本的特征，已经广泛使用了各种差异表达分析方法。多重假设检验可能会遗漏解释性特征，每个特征可能由单独的非显著变量组成。由于大规模矩阵运算的计算开销较大，因此多元假设检验的地位并不主流。随机森林提供了一种用于计算变量重要性的分类策略。然而，它可能不适用于不同分布的样本。

结果

基于使用集成分类器的思想，我们开发了一种用于表达谱差异表达分析的特征选择工具（简称 ECFS-DEA）。考虑到样本分布的差异，设计了一个图形用户界面，允许选择不同的基础分类器。受随机森林的启发，提出了一种适用于任何基础分类器的通用度量标准，用于计算变量重要性。在对排序后的单个变量进行特征交互选择后，使用 k-均值聚类呈现投影热图。还提供了 ROC 曲线，两者都可以直观地展示所选特征的有效性。

结论

通过集成分类器进行特征选择有助于选择重要变量，因此适用于不同的样本分布。对模拟和真实数据的实验表明，ECFS-DEA 对表达谱的差异表达分析是有效的。该软件可在 http://bio-nefu.com/resource/ecfs-dea 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/73d3/7003361/4707afe1c3ec/12859_2020_3388_Fig1_HTML.jpg

相似文献

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles.ECFS-DEA：基于集成分类器的特征选择方法，用于表达谱上的差异表达分析。

BMC Bioinformatics. 2020 Feb 5;21(1):43. doi: 10.1186/s12859-020-3388-y.

JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles.JCD-DEA：一种联合协变量检测工具，用于肿瘤表达谱的差异表达分析。

BMC Bioinformatics. 2019 Jun 28;20(1):365. doi: 10.1186/s12859-019-2893-3.

Ensemble classification based feature selection: a case of identification on plant pentatricopeptide repeat proteins.基于集成分类的特征选择：以植物五肽重复蛋白的鉴定为例。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac369.

Fissures segmentation using surface features: content-based retrieval for mammographic mass using ensemble classifier.利用表面特征进行裂隙分割：基于内容的乳腺肿块检索使用集成分类器。

Acad Radiol. 2011 Dec;18(12):1475-84. doi: 10.1016/j.acra.2011.08.012.

Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis.基于数据重采样策略训练的随机森林集成分类器，用于改善心律失常诊断。

Comput Biol Med. 2011 May;41(5):265-71. doi: 10.1016/j.compbiomed.2011.03.001. Epub 2011 Mar 17.

Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size.计算机辅助诊断中的特征选择与分类器性能：有限样本量的影响。

Med Phys. 2000 Jul;27(7):1509-22. doi: 10.1118/1.599017.

Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles.基于双重选择的半监督聚类集成用于从基因表达谱中进行肿瘤聚类

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jul-Aug;11(4):727-40. doi: 10.1109/TCBB.2014.2315996.

Effect of finite sample size on feature selection and classification: a simulation study.有限样本大小对特征选择和分类的影响：一项模拟研究。

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

A novel feature selection approach for biomedical data classification.一种用于生物医学数据分类的新特征选择方法。

J Biomed Inform. 2010 Feb;43(1):15-23. doi: 10.1016/j.jbi.2009.07.008. Epub 2009 Jul 30.

IOFS-SA: An interactive online feature selection tool for survival analysis.IOFS-SA：一种用于生存分析的交互式在线特征选择工具。

Comput Biol Med. 2022 Nov;150:106121. doi: 10.1016/j.compbiomed.2022.106121. Epub 2022 Sep 24.

引用本文的文献

Empirical comparison and recent advances of computational prediction of hormone binding proteins using machine learning methods.使用机器学习方法对激素结合蛋白进行计算预测的实证比较与最新进展

Comput Struct Biotechnol J. 2023 Mar 17;21:2253-2261. doi: 10.1016/j.csbj.2023.03.024. eCollection 2023.

Identification of a Five-miRNA Signature for Diagnosis of Kidney Renal Clear Cell Carcinoma.用于诊断肾透明细胞癌的五种 microRNA 特征的鉴定

Front Genet. 2022 Apr 20;13:857411. doi: 10.3389/fgene.2022.857411. eCollection 2022.

Identifying and Classifying Enhancers by Dinucleotide-Based Auto-Cross Covariance and Attention-Based Bi-LSTM.基于二核苷酸自交协方差和基于注意力的双向 LSTM 识别和分类增强子

Comput Math Methods Med. 2022 Apr 5;2022:7518779. doi: 10.1155/2022/7518779. eCollection 2022.

Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm.通过极端梯度提升算法识别DNA结合蛋白。

Front Genet. 2022 Jan 28;12:821996. doi: 10.3389/fgene.2021.821996. eCollection 2021.

Identification of Diagnostic Markers for Breast Cancer Based on Differential Gene Expression and Pathway Network.基于差异基因表达和通路网络的乳腺癌诊断标志物鉴定

Front Cell Dev Biol. 2022 Jan 12;9:811585. doi: 10.3389/fcell.2021.811585. eCollection 2021.

SNAREs-SAP: SNARE Proteins Identification With PSSM Profiles.SNAREs-SAP：利用位置特异性得分矩阵（PSSM）谱识别SNARE蛋白

Front Genet. 2021 Dec 20;12:809001. doi: 10.3389/fgene.2021.809001. eCollection 2021.

Bioinformatics Research on Drug Sensitivity Prediction.药物敏感性预测的生物信息学研究

Front Pharmacol. 2021 Dec 9;12:799712. doi: 10.3389/fphar.2021.799712. eCollection 2021.

Pseudo-188D: Phage Protein Prediction Based on a Model of Pseudo-188D.伪188D：基于伪188D模型的噬菌体蛋白质预测

Front Genet. 2021 Dec 1;12:796327. doi: 10.3389/fgene.2021.796327. eCollection 2021.

iAIPs: Identifying Anti-Inflammatory Peptides Using Random Forest.iAIPs：使用随机森林识别抗炎肽

Front Genet. 2021 Nov 30;12:773202. doi: 10.3389/fgene.2021.773202. eCollection 2021.

KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest.KK-DBP：一种基于随机森林的用于DNA结合蛋白识别的多特征融合方法

Front Genet. 2021 Nov 29;12:811158. doi: 10.3389/fgene.2021.811158. eCollection 2021.

本文引用的文献

Fold-LTR-TCP: protein fold recognition based on triadic closure principle.Fold-LTR-TCP：基于三元闭合原理的蛋白质折叠识别。

Brief Bioinform. 2020 Dec 1;21(6):2185-2193. doi: 10.1093/bib/bbz139.

MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks.MotifCNN-fold：基于基于模体的卷积神经网络提取的折叠特异特征的蛋白质折叠识别。

Brief Bioinform. 2020 Dec 1;21(6):2133-2141. doi: 10.1093/bib/bbz133.

DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks.DeepSVM-fold：通过结合支持向量机和深度学习网络生成的成对序列相似性得分来进行蛋白质折叠识别。

Brief Bioinform. 2020 Sep 25;21(5):1733-1741. doi: 10.1093/bib/bbz098.

A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features.一种通过二肽和氨基酸组成特征优化的随机森林亚高尔基体蛋白分类器。

Front Bioeng Biotechnol. 2019 Sep 4;7:215. doi: 10.3389/fbioe.2019.00215. eCollection 2019.

BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.BioSeq-Analysis2.0：一个基于机器学习方法的更新平台，用于在序列水平和残基水平上分析 DNA、RNA 和蛋白质序列。

Nucleic Acids Res. 2019 Nov 18;47(20):e127. doi: 10.1093/nar/gkz740.

iProEP: A Computational Predictor for Predicting Promoter.iProEP：一种用于预测启动子的计算预测工具。

Mol Ther Nucleic Acids. 2019 Sep 6;17:337-346. doi: 10.1016/j.omtn.2019.05.028. Epub 2019 Jun 13.

JCD-DEA: a joint covariate detection tool for differential expression analysis on tumor expression profiles.JCD-DEA：一种联合协变量检测工具，用于肿瘤表达谱的差异表达分析。

BMC Bioinformatics. 2019 Jun 28;20(1):365. doi: 10.1186/s12859-019-2893-3.

NCAPG2 overexpression promotes hepatocellular carcinoma proliferation and metastasis through activating the STAT3 and NF-κB/miR-188-3p pathways.NCAPG2 过表达通过激活 STAT3 和 NF-κB/miR-188-3p 通路促进肝癌增殖和转移。

EBioMedicine. 2019 Jun;44:237-249. doi: 10.1016/j.ebiom.2019.05.053. Epub 2019 Jun 5.

Evaluation of different computational methods on 5-methylcytosine sites identification.不同计算方法在 5-甲基胞嘧啶位点识别中的评估。

Brief Bioinform. 2020 May 21;21(3):982-995. doi: 10.1093/bib/bbz048.

Incorporating Distance-Based Top-n-gram and Random Forest To Identify Electron Transport Proteins.基于距离的 Top-n-gram 和随机森林在鉴定电子传递蛋白中的应用。

J Proteome Res. 2019 Jul 5;18(7):2931-2939. doi: 10.1021/acs.jproteome.9b00250. Epub 2019 Jun 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ECFS-DEA：基于集成分类器的特征选择方法，用于表达谱上的差异表达分析。

ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献