一种基于灵活的无模型预测的特征排序框架。

A flexible model-free prediction-based framework for feature ranking.

作者信息

Li Jingyi Jessica, Chen Yiling Elaine, Tong Xin

机构信息

Department of Statistics, University of California, Los Angeles.

Department of Data Sciences and Operations, Marshall Business School, University of Southern California.

出版信息

J Mach Learn Res. 2021 May;22.

PMID:35321091

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8939838/

Abstract

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.

摘要

尽管有众多用于联合特征建模的统计和机器学习工具，但许多科学家仍逐一地对特征进行边际研究，即一次只研究一个特征。部分原因在于训练和惯例，但也源于科学家对简单可视化和可解释性的浓厚兴趣。因此，在科学发现过程中，对某些预测任务（例如癌症驱动基因的预测）进行边际特征排序的做法很普遍。在这项工作中，我们专注于二分类的边际排序，这是最常见的预测任务之一。我们认为，包括皮尔逊相关、两样本t检验和两样本威尔科克森秩和检验在内的最广泛使用的边际排序标准，并未充分考虑特征分布和预测目标。为了在实践中弥补这一差距，我们针对两个预测目标提出了两个排序标准：经典标准（CC）和奈曼 - 皮尔逊标准（NPC），这两个标准都使用无模型的非参数实现方式来适应不同的特征分布。从理论上讲，我们表明在正则条件下，这两个标准都能实现样本级排序，且与它们的总体级对应标准具有很高的概率一致性。此外，当样本中的两类比例与总体中的比例不同时，NPC对抽样偏差具有鲁棒性。这一特性使NPC在抽样偏差普遍存在的生物医学研究中具有良好的潜力。我们在模拟和实际数据研究中展示了CC和NPC的使用方法及相对优势。我们基于无模型目标的排序思想可扩展到对特征子集进行排序，并可推广到其他预测任务和学习目标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02b8/8939838/63bc8862f25f/nihms-1737893-f0009.jpg

相似文献

A flexible model-free prediction-based framework for feature ranking.一种基于灵活的无模型预测的特征排序框架。

J Mach Learn Res. 2021 May;22.

Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合，以预测放射性肺损伤。

Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction.机器学习中特征选择的最佳评分对及其在癌症预后预测中的应用。

BMC Bioinformatics. 2011 Sep 23;12:375. doi: 10.1186/1471-2105-12-375.

Joint Ranking SVM and Binary Relevance with robust Low-rank learning for multi-label classification.联合排序支持向量机和二进制相关性与稳健的低秩学习进行多标签分类。

Neural Netw. 2020 Feb;122:24-39. doi: 10.1016/j.neunet.2019.10.002. Epub 2019 Oct 18.

Incorporating feature ranking and evolutionary methods for the classification of high-dimensional DNA microarray gene expression data.结合特征排序和进化方法用于高维DNA微阵列基因表达数据的分类

Australas Med J. 2013 May 30;6(5):272-9. doi: 10.4066/AMJ.2013.1641. Print 2013.

A novel approach for personalized response model: deep learning with individual dropout feature ranking.一种新的个性化响应模型方法：基于个体失活特征排序的深度学习。

J Pharmacokinet Pharmacodyn. 2021 Feb;48(1):165-179. doi: 10.1007/s10928-020-09724-x. Epub 2020 Oct 26.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data.CAFÉ-Map：用于挖掘高维生物医学数据的上下文感知特征映射。

Comput Biol Med. 2016 Dec 1;79:68-79. doi: 10.1016/j.compbiomed.2016.10.006. Epub 2016 Oct 11.

Structural MRI-based detection of Alzheimer's disease using feature ranking and classification error.基于结构磁共振成像，利用特征排序和分类误差检测阿尔茨海默病。

Comput Methods Programs Biomed. 2016 Dec;137:177-193. doi: 10.1016/j.cmpb.2016.09.019. Epub 2016 Sep 26.

A comparative study on feature selection for a risk prediction model for colorectal cancer.用于结直肠癌风险预测模型的特征选择的比较研究。

Comput Methods Programs Biomed. 2019 Aug;177:219-229. doi: 10.1016/j.cmpb.2019.06.001. Epub 2019 Jun 4.

引用本文的文献

Neyman-Pearson Multi-class Classification via Cost-sensitive Learning.通过成本敏感学习实现的奈曼-皮尔逊多类分类

J Am Stat Assoc. 2025;120(550):1164-1177. doi: 10.1080/01621459.2024.2402567. Epub 2024 Nov 19.

sconce: a cosmic web finder for spherical and conic geometries.斯康斯：一种用于球形和圆锥几何形状的宇宙网探测器。

Mon Not R Astron Soc. 2022 Oct 8;517(1):1197-1217. doi: 10.1093/mnras/stac2504. eCollection 2022 Nov.

本文引用的文献

DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features.利用遗传和表观遗传特征发现癌基因和肿瘤抑制基因。

Sci Adv. 2020 Nov 11;6(46). doi: 10.1126/sciadv.aba6784. Print 2020 Nov.

Temporal Stability and Prognostic Biomarker Potential of the Prostate Cancer Urine miRNA Transcriptome.前列腺癌尿液 miRNA 转录组的时间稳定性和预后生物标志物潜力。

J Natl Cancer Inst. 2020 Mar 1;112(3):247-255. doi: 10.1093/jnci/djz112.

The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019.NHGRI-EBI GWAS Catalog 于 2019 年发布的已发表全基因组关联研究、靶向基因芯片和汇总统计数据

Nucleic Acids Res. 2019 Jan 8;47(D1):D1005-D1012. doi: 10.1093/nar/gky1120.

Neyman-Pearson classification algorithms and NP receiver operating characteristics.Neyman-Pearson 分类算法和 NP 接收机工作特性。

Sci Adv. 2018 Feb 2;4(2):eaao1659. doi: 10.1126/sciadv.aao1659. eCollection 2018 Feb.

Antioxydation And Cell Migration Genes Are Identified as Potential Therapeutic Targets in Basal-Like and BRCA1 Mutated Breast Cancer Cell Lines.抗氧化和细胞迁移基因被鉴定为基底样和 BRCA1 突变乳腺癌细胞系的潜在治疗靶点。

Int J Med Sci. 2018 Jan 1;15(1):46-58. doi: 10.7150/ijms.20508. eCollection 2018.

TSC22D2 interacts with PKM2 and inhibits cell growth in colorectal cancer.TSC22D2与PKM2相互作用并抑制结直肠癌中的细胞生长。

Int J Oncol. 2016 Sep;49(3):1046-56. doi: 10.3892/ijo.2016.3599. Epub 2016 Jul 4.

Genome-wide DNA methylation profiles in progression to in situ and invasive carcinoma of the breast with impact on gene transcription and prognosis.乳腺癌进展为原位癌和浸润癌过程中的全基因组DNA甲基化谱及其对基因转录和预后的影响

Genome Biol. 2014;15(8):435. doi: 10.1186/PREACCEPT-2333349012841587. Epub 2014 Aug 22.

Whole genome DNA methylation signature of HER2-positive breast cancer.人表皮生长因子受体2阳性乳腺癌的全基因组DNA甲基化特征

Epigenetics. 2014 Aug;9(8):1149-62. doi: 10.4161/epi.29632. Epub 2014 Jul 8.

Cumulative haploinsufficiency and triplosensitivity drive aneuploidy patterns and shape the cancer genome.累积的单倍体不足和三倍体敏感性驱动非整倍体模式，并塑造癌症基因组。

Cell. 2013 Nov 7;155(4):948-62. doi: 10.1016/j.cell.2013.10.011. Epub 2013 Oct 31.

On Brownian Distance Covariance and High Dimensional Data.关于布朗距离协方差与高维数据

Ann Appl Stat. 2009 Jan 1;3(4):1266-1269. doi: 10.1214/09-AOAS312.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于灵活的无模型预测的特征排序框架。

A flexible model-free prediction-based framework for feature ranking.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献