• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

非参数IPSS:具有错误发现控制的快速、灵活的特征选择

Nonparametric IPSS: fast, flexible feature selection with false discovery control.

作者信息

Melikechi Omar, Dunson David B, Miller Jeffrey W

机构信息

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, United States.

Department of Statistical Science, Duke University, Durham, NC, 27708, United States.

出版信息

Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf299.

DOI:10.1093/bioinformatics/btaf299
PMID:40358526
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12119134/
Abstract

MOTIVATION

Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives.

RESULTS

We introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than P-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 s when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.

AVAILABILITY AND IMPLEMENTATION

All code and data used in this work are available on GitHub (https://github.com/omelikechi/ipss_bioinformatics) and permanently archived on Zenodo (https://doi.org/10.5281/zenodo.15335289). A Python package for implementing IPSS is available on GitHub (https://github.com/omelikechi/ipss) and PyPI (https://pypi.org/project/ipss/). An R implementation of IPSS is also available on GitHub (https://github.com/omelikechi/ipssR).

摘要

动机

特征选择是机器学习和统计学中的一项关键任务。然而,现有的特征选择方法要么(i)依赖于参数方法,如线性或广义线性模型;(ii)缺乏理论上的错误发现控制;要么(iii)识别出的真阳性较少。

结果

我们引入了一种基于将集成路径稳定性选择(IPSS)应用于任意特征重要性得分的具有有限样本错误发现控制的通用特征选择方法。只要重要性得分是非参数的,该方法就是非参数的,并且它估计q值,与P值相比,q值更适合高维数据。我们重点关注使用梯度提升(IPSSGB)和随机森林(IPSSRF)的重要性得分的两种特殊情况。对RNA测序数据进行的广泛非线性模拟表明,这两种方法都能准确控制错误发现率,并且比现有方法检测到更多的真阳性。这两种方法也都很高效,当有500个样本和5000个特征时,运行时间不到20秒。我们应用IPSSGB和IPSSRF来检测与癌症相关的 microRNA 和基因,发现它们用比现有方法更少的特征就能产生更好的预测。

可用性和实现

本工作中使用的所有代码和数据可在GitHub(https://github.com/omelikechi/ipss_bioinformatics)上获取,并永久存档于Zenodo(https://doi.org/10.5281/zenodo.15335289)。用于实现IPSS的Python包可在GitHub(https://github.com/omelikechi/ipss)和PyPI(https://pypi.org/project/ipss/)上获取。IPSS的R实现也可在GitHub(https://github.com/omelikechi/ipssR)上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/2763aeaf92da/btaf299f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/7f3378fb4056/btaf299f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/8e050148ee78/btaf299f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/2763aeaf92da/btaf299f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/7f3378fb4056/btaf299f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/8e050148ee78/btaf299f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8042/12119134/2763aeaf92da/btaf299f3.jpg

相似文献

1
Nonparametric IPSS: fast, flexible feature selection with false discovery control.非参数IPSS:具有错误发现控制的快速、灵活的特征选择
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf299.
2
JAX-RNAfold: scalable differentiable folding.JAX-RNAfold:可扩展的可微折叠
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf203.
3
pyaging: a Python-based compendium of GPU-optimized aging clocks.pyaging:一个基于 Python 的 GPU 优化老化时钟综合手册。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae200.
4
A machine-learning-based alternative to phylogenetic bootstrap.基于机器学习的替代系统,用于替代系统发育 bootstrap 分析。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i208-i217. doi: 10.1093/bioinformatics/btae255.
5
PyGenePlexus: a Python package for gene discovery using network-based machine learning.PyGenePlexus:一个使用基于网络的机器学习进行基因发现的 Python 包。
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad064.
6
A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data.基于机器学习的方法,用于自动识别注释单细胞 RNA-seq 数据中的新型细胞。
Bioinformatics. 2022 Oct 31;38(21):4885-4892. doi: 10.1093/bioinformatics/btac617.
7
PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning.PanClassif:使用机器学习改进单细胞RNA测序基因表达数据的泛癌分类
Genomics. 2022 Mar;114(2):110264. doi: 10.1016/j.ygeno.2022.01.001. Epub 2022 Jan 6.
8
dRFEtools: dynamic recursive feature elimination for omics.dRFEtools:组学的动态递归特征消除。
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad513.
9
SimSeq: a nonparametric approach to simulation of RNA-sequence datasets.SimSeq:一种用于RNA序列数据集模拟的非参数方法。
Bioinformatics. 2015 Jul 1;31(13):2131-40. doi: 10.1093/bioinformatics/btv124. Epub 2015 Feb 26.
10
scFates: a scalable python package for advanced pseudotime and bifurcation analysis from single-cell data.scFates:一个用于从单细胞数据中进行高级拟时和分支分析的可扩展 Python 包。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac746.

本文引用的文献

1
Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods.基于稳健机器学习-递归特征消除方法的基因表达数据的稳健生物标志物筛选。
Comput Biol Chem. 2022 Oct;100:107747. doi: 10.1016/j.compbiolchem.2022.107747. Epub 2022 Jul 29.
2
A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.用于分类预测建模的随机森林变量选择方法比较
Expert Syst Appl. 2019 Nov 15;134:93-101. doi: 10.1016/j.eswa.2019.05.028. Epub 2019 May 23.
3
Knockoff boosted tree for model-free variable selection.
无模型变量选择的仿射提升树。
Bioinformatics. 2021 May 17;37(7):976-983. doi: 10.1093/bioinformatics/btaa770.
4
LinkedOmics: analyzing multi-omics data within and across 32 cancer types.LinkedOmics:在 32 种癌症类型内和类型间分析多组学数据。
Nucleic Acids Res. 2018 Jan 4;46(D1):D956-D963. doi: 10.1093/nar/gkx1090.
5
Evaluation of variable selection methods for random forests and omics data sets.随机森林和组学数据集变量选择方法的评估。
Brief Bioinform. 2019 Mar 22;20(2):492-503. doi: 10.1093/bib/bbx124.
6
Expression and prognostic value of the WEE1 kinase in gliomas.WEE1激酶在胶质瘤中的表达及预后价值
J Neurooncol. 2016 Apr;127(2):381-9. doi: 10.1007/s11060-015-2050-4. Epub 2016 Jan 6.
7
Controlling false discoveries in high-dimensional situations: boosting with stability selection.在高维情形下控制错误发现:基于稳定性选择的增强方法
BMC Bioinformatics. 2015 May 6;16:144. doi: 10.1186/s12859-015-0575-3.
8
The Cancer Genome Atlas Pan-Cancer analysis project.癌症基因组图谱泛癌分析项目。
Nat Genet. 2013 Oct;45(10):1113-20. doi: 10.1038/ng.2764.
9
FoxM1: a master regulator of tumor metastasis.FoxM1:肿瘤转移的主控调节器。
Cancer Res. 2011 Jul 1;71(13):4329-33. doi: 10.1158/0008-5472.CAN-11-0640. Epub 2011 Jun 28.
10
Statistical significance for genomewide studies.全基因组研究的统计学显著性
Proc Natl Acad Sci U S A. 2003 Aug 5;100(16):9440-5. doi: 10.1073/pnas.1530509100. Epub 2003 Jul 25.