统计推断缓解（STIR）特征选择。

STatistical Inference Relief (STIR) feature selection.

机构信息

Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.

Department of Mathematics, University of Tulsa, Tulsa, OK, USA.

出版信息

Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.

DOI:10.1093/bioinformatics/bty788

PMID:30239600

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6477983/

Abstract

MOTIVATION

Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.

RESULTS

We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.

AVAILABILITY AND IMPLEMENTATION

Code and data available at http://insilico.utulsa.edu/software/STIR.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

Relief 是一类机器学习算法，它使用最近邻来选择与结果相关的特征，这些特征可能是由于上位性或与高维数据中其他特征的统计相互作用而产生的。Relief 基估计器在统计上是非参数的，因为它们没有参数化模型，也没有估计器的基本概率分布，这使得很难确定 Relief 基属性估计的统计显著性。因此，需要一种统计推断形式主义来避免强加任意阈值来选择最重要的特征。我们重新概念化 Relief 基特征选择算法，创建了一个新的 STatistical Inference Relief (STIR) 估计器家族，该家族保留了识别交互作用的能力，同时将最近邻距离的样本方差纳入属性重要性估计中。这种方差允许计算特征的统计显著性，并调整 Relief 基得分的多重检验。具体来说，我们为病例对照数据开发了基于 Relief 的算法的伪 t 检验版本。

结果

我们在一组模拟数据上展示了 STIR 特征选择方法家族的统计功效和 I 型错误控制，这些模拟数据表现出反映在真实基因表达数据中的特性，包括主效应和网络交互效应。我们比较了使用自适应半径方法作为最近邻构造器的 STIR 与使用固定 k 最近邻构造器的 STIR 的性能。我们将 STIR 应用于一项重度抑郁症研究的真实 RNA-Seq 数据，并讨论了 STIR 对全基因组关联研究的直接扩展。

可用性和实现

代码和数据可在 http://insilico.utulsa.edu/software/STIR 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d9c/6477983/e899bad52ad7/bty788f1.jpg

相似文献

STatistical Inference Relief (STIR) feature selection.

Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.

ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

PLoS One. 2013 Dec 10;8(12):e81527. doi: 10.1371/journal.pone.0081527. eCollection 2013.

Nearest-neighbor Projected-Distance Regression (NPDR) for detecting network interactions with adjustments for multiple tests and confounding.

Bioinformatics. 2020 May 1;36(9):2770-2777. doi: 10.1093/bioinformatics/btaa024.

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

Nearest-Neighbor Projected Distance Regression for Epistasis Detection in GWAS With Population Structure Correction.

Front Genet. 2020 Jul 22;11:784. doi: 10.3389/fgene.2020.00784. eCollection 2020.

Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.

Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.

BMC Bioinformatics. 2019 Jun 13;20(1):333. doi: 10.1186/s12859-019-2869-3.

Theoretical properties of distance distributions and novel metrics for nearest-neighbor feature selection.

PLoS One. 2021 Feb 8;16(2):e0246761. doi: 10.1371/journal.pone.0246761. eCollection 2021.

mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation.

Bioinformatics. 2019 Aug 15;35(16):2757-2765. doi: 10.1093/bioinformatics/bty1047.

Scaling tree-based automated machine learning to biomedical big data with a feature set selector.

Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.

引用本文的文献

Multivariate Optimization of k for k-Nearest-Neighbor Feature Selection With Dichotomous Outcomes: Complex Associations, Class Imbalance, and Application to RNA-Seq in Major Depressive Disorder.

IEEE Trans Comput Biol Bioinform. 2025 Jan-Feb;22(1):39-51. doi: 10.1109/TCBBIO.2024.3494599.

Detection of Aspergilloma Disease Using Feature-Selection-Based Vision Transformers.

Diagnostics (Basel). 2024 Dec 26;15(1):26. doi: 10.3390/diagnostics15010026.

Signature Genes Selection and Functional Analysis of Astrocytoma Phenotypes: A Comparative Study.

Cancers (Basel). 2024 Sep 25;16(19):3263. doi: 10.3390/cancers16193263.

Individualized treatment decision model for inoperable elderly esophageal squamous cell carcinoma based on multi-modal data fusion.

BMC Med Inform Decis Mak. 2023 Oct 23;23(1):237. doi: 10.1186/s12911-023-02339-5.

An Efficient Feature Selection Algorithm for Gene Families Using NMF and ReliefF.

Genes (Basel). 2023 Feb 6;14(2):421. doi: 10.3390/genes14020421.

Bi-dimensional principal gene feature selection from big gene expression data.

PLoS One. 2022 Dec 7;17(12):e0278583. doi: 10.1371/journal.pone.0278583. eCollection 2022.

Research progress of reduced amino acid alphabets in protein analysis and prediction.

Comput Struct Biotechnol J. 2022 Jul 4;20:3503-3510. doi: 10.1016/j.csbj.2022.07.001. eCollection 2022.

Construction and Multiple Feature Classification Based on a High-Order Functional Hypernetwork on fMRI Data.

Front Neurosci. 2022 Apr 13;16:848363. doi: 10.3389/fnins.2022.848363. eCollection 2022.

EPIMUTESTR: a nearest neighbor machine learning approach to predict cancer driver genes from the evolutionary action of coding variants.

Nucleic Acids Res. 2022 Jul 8;50(12):e70. doi: 10.1093/nar/gkac215.

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics.

Hum Genet. 2022 Sep;141(9):1515-1528. doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.

本文引用的文献

Identification and replication of RNA-Seq gene network modules associated with depression severity.

Transl Psychiatry. 2018 Sep 5;8(1):180. doi: 10.1038/s41398-018-0234-3.

Relief-based feature selection: Introduction and review.

J Biomed Inform. 2018 Sep;85:189-203. doi: 10.1016/j.jbi.2018.07.014. Epub 2018 Jul 18.

Benchmarking relief-based feature selection methods for bioinformatics data mining.

J Biomed Inform. 2018 Sep;85:168-188. doi: 10.1016/j.jbi.2018.07.015. Epub 2018 Jul 17.

Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

Bioinformatics. 2017 Sep 15;33(18):2906-2913. doi: 10.1093/bioinformatics/btx298.

Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure.

BioData Min. 2015 Feb 3;8:5. doi: 10.1186/s13040-015-0040-x. eCollection 2015.

ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

PLoS One. 2013 Dec 10;8(12):e81527. doi: 10.1371/journal.pone.0081527. eCollection 2013.

Epigenetic epistatic interactions constrain the evolution of gene expression.

Mol Syst Biol. 2013;9:645. doi: 10.1038/msb.2013.2.

Six Degrees of Epistasis: Statistical Network Models for GWAS.

Front Genet. 2012 Jan 12;2:109. doi: 10.3389/fgene.2011.00109. eCollection 2011.

Spatially uniform relieff (SURF) for computationally-efficient filtering of gene-gene interactions.

BioData Min. 2009 Sep 22;2(1):5. doi: 10.1186/1756-0381-2-5.

Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis.

PLoS Genet. 2009 Mar;5(3):e1000432. doi: 10.1371/journal.pgen.1000432. Epub 2009 Mar 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

统计推断缓解（STIR）特征选择。

STatistical Inference Relief (STIR) feature selection.

机构信息

Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.

Department of Mathematics, University of Tulsa, Tulsa, OK, USA.