• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于稳健技术的高维数据最优特征选择:在不同健康数据库中的应用

Optimal features selection in the high dimensional data based on robust technique: Application to different health database.

作者信息

Hussain Ibrar, Qureshi Moiz, Ismail Muhammad, Iftikhar Hasnain, Zywiołek Justyna, López-Gonzales Javier Linkolk

机构信息

Department of Statistics Abdul Wali Khan University Mardan, Pakistan.

Govt Boys Degree College Tandojam, Hyderabad, Sindh, Pakistan.

出版信息

Heliyon. 2024 Sep 2;10(17):e37241. doi: 10.1016/j.heliyon.2024.e37241. eCollection 2024 Sep 15.

DOI:10.1016/j.heliyon.2024.e37241
PMID:39296019
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11408077/
Abstract

Bio-informatics and gene expression analysis face major hurdles when dealing with high-dimensional data, where the number of variables or genes much outweighs the number of samples. These difficulties are exacerbated, particularly in microarray data processing, by redundant genes that do not significantly contribute to the response variable. To address this issue, gene selection emerges as a feasible method for identifying the most important genes, hence reducing the generalization error of classification algorithms. This paper introduces a new hybrid approach for gene selection by combining the Signal-to-Noise Ratio (SNR) score with the robust Mood median test. The Mood median test is beneficial for reducing the impact of outliers in non-normal or skewed data since it may successfully identify genes with significant changes across groups. The SNR score measures the significance of a gene's classification by comparing the gap between class means and within-class variability. By integrating both of these approaches, the suggested approach aims to find genes that are significant for classification tasks. The major objective of this study is to evaluate the effectiveness of this combination approach in choosing the optimal genes. A significant P-value is consistently identified for each gene using the Mood median test and the SNR score. By dividing the SNR value of each gene by its significant P-value, the Md score is calculated. Genes with a high signal-to-noise ratio (SNR) have been considered favorable due to their minimal noise influence and significant classification importance. To verify the effectiveness of the selected genes, the study utilizes two dependable classification techniques: Random Forest and K-Nearest Neighbors (KNN). These algorithms were chosen due to their track record of successfully completing categorization-related tasks. The performance of the selected genes is evaluated using two metrics: error reduction and classification accuracy. These metrics offer an in-depth assessment of how well the selected genes improve classification accuracy and consistency. According to the findings, the hybrid approach put out here outperforms conventional gene selection methods in high-dimensional datasets and has lower classification error rates. There are considerable improvements in classification accuracy and error reduction when specific genes are exposed to the Random Forest and KNN classifiers. The outcomes demonstrate how this hybrid technique might be a helpful tool to improve gene selection processes in bioinformatics.

摘要

在处理高维数据时,生物信息学和基因表达分析面临着重大障碍,其中变量或基因的数量远远超过样本数量。这些困难在微阵列数据处理中尤其严重,因为存在对响应变量没有显著贡献的冗余基因。为了解决这个问题,基因选择作为一种可行的方法出现了,用于识别最重要的基因,从而降低分类算法的泛化误差。本文介绍了一种新的基因选择混合方法,该方法将信噪比(SNR)评分与稳健的穆德中位数检验相结合。穆德中位数检验有利于减少非正态或偏态数据中异常值的影响,因为它可以成功识别不同组间有显著变化的基因。信噪比评分通过比较类均值之间的差距和类内变异性来衡量基因分类的显著性。通过整合这两种方法,所提出的方法旨在找到对分类任务具有显著性的基因。本研究的主要目的是评估这种组合方法在选择最佳基因方面的有效性。使用穆德中位数检验和信噪比评分,为每个基因持续确定一个显著的P值。通过将每个基因的信噪比(SNR)值除以其显著的P值,计算出Md评分。具有高信噪比(SNR)的基因由于其最小的噪声影响和显著的分类重要性而被认为是有利的。为了验证所选基因的有效性,该研究使用了两种可靠的分类技术:随机森林和K近邻(KNN)。选择这些算法是因为它们在成功完成与分类相关任务方面的记录。使用两个指标评估所选基因的性能:误差减少和分类准确率。这些指标深入评估了所选基因在提高分类准确率和一致性方面的效果。根据研究结果,本文提出的混合方法在高维数据集中优于传统的基因选择方法,并且具有较低的分类错误率。当特定基因应用于随机森林和KNN分类器时,分类准确率和误差减少有显著提高。结果表明,这种混合技术可能是一种有助于改进生物信息学中基因选择过程的有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/22b24f83ea13/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/2a2aeb36d9ab/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/e24550652fe2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/22b24f83ea13/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/2a2aeb36d9ab/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/e24550652fe2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/53c3/11408077/22b24f83ea13/gr3.jpg

相似文献

1
Optimal features selection in the high dimensional data based on robust technique: Application to different health database.基于稳健技术的高维数据最优特征选择:在不同健康数据库中的应用
Heliyon. 2024 Sep 2;10(17):e37241. doi: 10.1016/j.heliyon.2024.e37241. eCollection 2024 Sep 15.
2
Feature selection via robust weighted score for high dimensional binary class-imbalanced gene expression data.通过稳健加权分数对高维二元类不平衡基因表达数据进行特征选择
Heliyon. 2024 Sep 30;10(19):e38547. doi: 10.1016/j.heliyon.2024.e38547. eCollection 2024 Oct 15.
3
A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm.基于图的基因选择方法,用于使用多目标 PSO 算法解决医学诊断问题。
BMC Med Inform Decis Mak. 2021 Nov 27;21(1):333. doi: 10.1186/s12911-021-01696-3.
4
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
5
AVNM: A Voting based Novel Mathematical Rule for Image Classification.AVNM:一种基于投票的图像分类新数学规则。
Comput Methods Programs Biomed. 2016 Dec;137:195-201. doi: 10.1016/j.cmpb.2016.08.015. Epub 2016 Sep 26.
6
An ensemble learning-based feature selection algorithm for identification of biomarkers of renal cell carcinoma.一种基于集成学习的用于识别肾细胞癌生物标志物的特征选择算法。
PeerJ Comput Sci. 2024 Jan 4;10:e1768. doi: 10.7717/peerj-cs.1768. eCollection 2024.
7
GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics.GSEA-SDBE:一种基于基因集富集分析(GSEA)并分析性能指标差异的乳腺癌分类基因选择方法。
PLoS One. 2022 Apr 26;17(4):e0263171. doi: 10.1371/journal.pone.0263171. eCollection 2022.
8
Gene selection and classification of microarray data using random forest.使用随机森林进行微阵列数据的基因选择与分类
BMC Bioinformatics. 2006 Jan 6;7:3. doi: 10.1186/1471-2105-7-3.
9
EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis.EKNN:将连通性和密度纳入k近邻算法的集成分类器及其在癌症诊断中的应用
Artif Intell Med. 2021 Jan;111:101985. doi: 10.1016/j.artmed.2020.101985. Epub 2020 Nov 8.
10
R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data.R-Ensembler:一种基于粗糙集的贪婪集成属性选择算法,具有 kNN 插补功能,用于医学数据的分类。
Comput Methods Programs Biomed. 2020 Feb;184:105122. doi: 10.1016/j.cmpb.2019.105122. Epub 2019 Oct 8.

引用本文的文献

1
A new auxiliary variables-based estimator for population distribution function under stratified random sampling and non-response.一种基于辅助变量的分层随机抽样和无回答情况下总体分布函数的估计量。
Sci Rep. 2025 Apr 19;15(1):13580. doi: 10.1038/s41598-025-98246-y.

本文引用的文献

1
Soret Effect on MHD Casson Fluid over an Accelerated Plate with the Help of Constant Proportional Caputo Fractional Derivative.基于常数比例 Caputo 分数阶导数的加速平板上 Soret 效应作用于磁流体动力学 Casson 流体
ACS Omega. 2024 Feb 23;9(9):10220-10232. doi: 10.1021/acsomega.3c07311. eCollection 2024 Mar 5.
2
A hybrid forecasting technique for infection and death from the mpox virus.一种针对猴痘病毒感染和死亡情况的混合预测技术。
Digit Health. 2023 Oct 3;9:20552076231204748. doi: 10.1177/20552076231204748. eCollection 2023 Jan-Dec.
3
Deep learning applications in single-cell genomics and transcriptomics data analysis.
深度学习在单细胞基因组学和转录组学数据分析中的应用。
Biomed Pharmacother. 2023 Sep;165:115077. doi: 10.1016/j.biopha.2023.115077. Epub 2023 Jul 1.
4
Short-Term Forecasting of Monkeypox Cases Using a Novel Filtering and Combining Technique.使用新型滤波与组合技术对猴痘病例进行短期预测
Diagnostics (Basel). 2023 May 31;13(11):1923. doi: 10.3390/diagnostics13111923.
5
On the Implementation of the Artificial Neural Network Approach for Forecasting Different Healthcare Events.关于人工神经网络方法在预测不同医疗事件中的应用
Diagnostics (Basel). 2023 Mar 31;13(7):1310. doi: 10.3390/diagnostics13071310.
6
Deep learning techniques for cancer classification using microarray gene expression data.使用微阵列基因表达数据进行癌症分类的深度学习技术。
Front Physiol. 2022 Sep 30;13:952709. doi: 10.3389/fphys.2022.952709. eCollection 2022.
7
Relative Fuzzy Rough Approximations for Feature Selection and Classification.相对模糊粗糙近似在特征选择和分类中的应用。
IEEE Trans Cybern. 2023 Apr;53(4):2200-2210. doi: 10.1109/TCYB.2021.3112674. Epub 2023 Mar 16.
8
Interferon target-gene expression and epigenomic signatures in health and disease.干扰素靶基因表达与健康和疾病中的表观基因组特征。
Nat Immunol. 2019 Dec;20(12):1574-1583. doi: 10.1038/s41590-019-0466-2. Epub 2019 Nov 19.
9
Second-generation molecular subgrouping of medulloblastoma: an international meta-analysis of Group 3 and Group 4 subtypes.第二代髓母细胞瘤分子亚组分类:Group 3 和 Group 4 亚型的国际荟萃分析。
Acta Neuropathol. 2019 Aug;138(2):309-326. doi: 10.1007/s00401-019-02020-0. Epub 2019 May 10.
10
Effects of Intestinal Microbial⁻Elaborated Butyrate on Oncogenic Signaling Pathways.丁酸对致癌信号通路的影响。
Nutrients. 2019 May 7;11(5):1026. doi: 10.3390/nu11051026.