STLBRF：一种基于标准化阈值的改进随机森林算法，用于基因表达数据的特征筛选。

STLBRF: an improved random forest algorithm based on standardized-threshold for feature screening of gene expression data.

作者信息

Feng Huini, Ju Ying, Yin Xiaofeng, Qiu Wenshi, Zhang Xu

机构信息

School of Mathematics and Statistics, Southwest University, Chongqing, China.

School of Informatics, Xiamen University, Xiamen, China.

出版信息

Brief Funct Genomics. 2025 Jan 15;24. doi: 10.1093/bfgp/elae048.

DOI:10.1093/bfgp/elae048

PMID:39736135

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11735748/

Abstract

When the traditional random forest (RF) algorithm is used to select feature elements in biostatistical data, a large amount of noise data and parameters can affect the importance of the selected feature elements, making the control of feature selection difficult. Therefore, it is a challenge for the traditional RF algorithm to preserve the accuracy of algorithm results in the presence of noise data. Generally, directly removing noise data can result in significant bias in the results. In this study, we develop a new algorithm, standardized threshold, and loops based random forest (STLBRF), and apply it to the field of gene expression data for feature gene selection. This algorithm, based on the traditional RF algorithm, combines backward elimination and K-fold cross-validation to construct a cyclic system and set a standardized threshold: error increment. The algorithm overcomes the shortcomings of existing gene selection methods. We compare ridge regression, lasso regression, elastic net regression, the traditional RF algorithm, and our improved RF algorithm using three real gene expression datasets and conducting a quantitative analysis. To ensure the reliability of the results, we validate the effectiveness of the genes selected by these methods using the Random Forest classifier. The results indicate that, compared to other methods, the STLBRF algorithm achieves not only higher effectiveness in feature gene selection but also better control over the number of selected genes. Our method offers reliable technical support for feature expression analysis and research on biomarker selection.

摘要

当使用传统随机森林（RF）算法在生物统计数据中选择特征元素时，大量噪声数据和参数会影响所选特征元素的重要性，使得特征选择的控制变得困难。因此，对于传统RF算法来说，在存在噪声数据的情况下保持算法结果的准确性是一项挑战。一般来说，直接去除噪声数据会导致结果出现显著偏差。在本研究中，我们开发了一种新算法，即基于标准化阈值和循环的随机森林（STLBRF），并将其应用于基因表达数据领域进行特征基因选择。该算法在传统RF算法的基础上，结合向后消除和K折交叉验证来构建一个循环系统，并设置一个标准化阈值：误差增量。该算法克服了现有基因选择方法的缺点。我们使用三个真实的基因表达数据集，并进行定量分析，比较了岭回归、套索回归、弹性网回归、传统RF算法和我们改进的RF算法。为确保结果的可靠性，我们使用随机森林分类器验证了这些方法所选基因的有效性。结果表明，与其他方法相比，STLBRF算法不仅在特征基因选择方面具有更高的有效性，而且对所选基因的数量有更好的控制。我们的方法为特征表达分析和生物标志物选择研究提供了可靠的技术支持。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1440/11735748/f6c913d19757/elae048f1.jpg

相似文献

STLBRF: an improved random forest algorithm based on standardized-threshold for feature screening of gene expression data.STLBRF：一种基于标准化阈值的改进随机森林算法，用于基因表达数据的特征筛选。

Brief Funct Genomics. 2025 Jan 15;24. doi: 10.1093/bfgp/elae048.

A population spatialization method based on the integration of feature selection and an improved random forest model.一种基于特征选择与改进随机森林模型集成的人口空间化方法。

PLoS One. 2025 Apr 3;20(4):e0321263. doi: 10.1371/journal.pone.0321263. eCollection 2025.

An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data.基于基因表达数据的多支持向量机技术的高效特征选择策略。

Biomed Res Int. 2018 Aug 30;2018:7538204. doi: 10.1155/2018/7538204. eCollection 2018.

Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm.基于改进的鹽蝽群算法的基因表达数据分类的两阶段特征选择

Math Biosci Eng. 2022 Sep 19;19(12):13747-13781. doi: 10.3934/mbe.2022641.

G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays.G-Forest：一种用于基因表达微阵列中成本敏感特征选择的集成方法。

Artif Intell Med. 2020 Aug;108:101941. doi: 10.1016/j.artmed.2020.101941. Epub 2020 Aug 14.

Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features.图随机森林：一种用于识别高度连接重要特征的图嵌入算法。

Biomolecules. 2023 Jul 20;13(7):1153. doi: 10.3390/biom13071153.

Prediction of stock price movement using an improved NSGA-II-RF algorithm with a three-stage feature engineering process.使用改进的 NSGA-II-RF 算法和三阶段特征工程流程预测股票价格走势。

PLoS One. 2023 Jun 28;18(6):e0287754. doi: 10.1371/journal.pone.0287754. eCollection 2023.

A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data.基于聚类和纵向数据的医学预测模型的特征选择随机森林方法。

J Biomed Inform. 2021 May;117:103763. doi: 10.1016/j.jbi.2021.103763. Epub 2021 Mar 26.

GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics.GSEA-SDBE：一种基于基因集富集分析（GSEA）并分析性能指标差异的乳腺癌分类基因选择方法。

PLoS One. 2022 Apr 26;17(4):e0263171. doi: 10.1371/journal.pone.0263171. eCollection 2022.

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network.多中心随机森林模型在协作临床研究网络中的有效预后预测。

Artif Intell Med. 2020 Mar;103:101814. doi: 10.1016/j.artmed.2020.101814. Epub 2020 Feb 5.

引用本文的文献

Deciphering the Regulatory Networks of the Migrasome-Associated Cell Subpopulation in Heterotopic Ossification via Multi-Omics Analysis.通过多组学分析破译异位骨化中与迁移体相关细胞亚群的调控网络

FASEB J. 2025 Jun 30;39(12):e70749. doi: 10.1096/fj.202500965R.

本文引用的文献

USP33 facilitates the ovarian cancer progression via deubiquitinating and stabilizing CBX2.USP33 通过去泛素化和稳定 CBX2 促进卵巢癌进展。

Oncogene. 2024 Oct;43(43):3170-3183. doi: 10.1038/s41388-024-03151-9. Epub 2024 Sep 10.

Differentially localized protein identification for breast cancer based on deep learning in immunohistochemical images.基于免疫组化图像深度学习的乳腺癌差异定位蛋白鉴定。

Commun Biol. 2024 Aug 2;7(1):935. doi: 10.1038/s42003-024-06548-0.

Recessive TMOD1 mutation causes childhood cardiomyopathy.隐性 TMOD1 突变导致儿童心肌病。

Commun Biol. 2024 Jan 2;7(1):7. doi: 10.1038/s42003-023-05670-9.

SERPINE1 and SERPINB7 as potential biomarkers for intravenous vitamin C treatment in non-small-cell lung cancer.丝氨酸蛋白酶抑制剂 1 和 7 可作为非小细胞肺癌患者静脉用维生素 C 治疗的潜在生物标志物。

Free Radic Biol Med. 2023 Nov 20;209(Pt 1):96-107. doi: 10.1016/j.freeradbiomed.2023.10.391. Epub 2023 Oct 12.

Single-cell transcriptomic analysis reveals crucial oncogenic signatures and its associative cell types involved in gastric cancer.单细胞转录组分析揭示了胃癌中关键的致癌特征及其相关细胞类型。

Med Oncol. 2023 Sep 23;40(10):305. doi: 10.1007/s12032-023-02174-8.

Tumor-Suppressive Functions of the Aryl Hydrocarbon Receptor (AhR) and AhR as a Therapeutic Target in Cancer.芳烃受体（AhR）的肿瘤抑制功能以及AhR作为癌症治疗靶点的研究

Biology (Basel). 2023 Mar 30;12(4):526. doi: 10.3390/biology12040526.

Retraction Note: CREB3L4 promotes angiogenesis and tumor progression in gastric cancer through regulating VEGFA expression.撤稿说明：CREB3L4 通过调节 VEGFA 表达促进胃癌血管生成和肿瘤进展。

Cancer Gene Ther. 2023 Jul;30(7):1040. doi: 10.1038/s41417-023-00613-2.

Gamma-aminobutyric Acid Type A Receptor Subunit Delta as a Potential Therapeutic Target in Gastric Cancer.γ-氨基丁酸A型受体δ亚基作为胃癌潜在的治疗靶点

Ann Surg Oncol. 2023 Jan;30(1):628-636. doi: 10.1245/s10434-022-12573-2. Epub 2022 Sep 20.

Haploinsufficiency Impacts Gastrointestinal Function and Leads to Pediatric Intestinal Pseudo-obstruction.单倍剂量不足影响胃肠功能并导致小儿肠道假性梗阻。

Front Cell Dev Biol. 2022 Jul 8;10:901824. doi: 10.3389/fcell.2022.901824. eCollection 2022.

Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data.评估随机森林和基于迭代随机森林的方法应用于基因表达数据时的性能。

Comput Struct Biotechnol J. 2022 Jun 22;20:3372-3386. doi: 10.1016/j.csbj.2022.06.037. eCollection 2022.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

STLBRF：一种基于标准化阈值的改进随机森林算法，用于基因表达数据的特征筛选。

STLBRF: an improved random forest algorithm based on standardized-threshold for feature screening of gene expression data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献