Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
Department of Mathematics, University of Tulsa, Tulsa, OK, USA.
Bioinformatics. 2019 Apr 15;35(8):1358-1365. doi: 10.1093/bioinformatics/bty788.
Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.
We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.
Code and data available at http://insilico.utulsa.edu/software/STIR.
Supplementary data are available at Bioinformatics online.
Relief 是一类机器学习算法,它使用最近邻来选择与结果相关的特征,这些特征可能是由于上位性或与高维数据中其他特征的统计相互作用而产生的。Relief 基估计器在统计上是非参数的,因为它们没有参数化模型,也没有估计器的基本概率分布,这使得很难确定 Relief 基属性估计的统计显著性。因此,需要一种统计推断形式主义来避免强加任意阈值来选择最重要的特征。我们重新概念化 Relief 基特征选择算法,创建了一个新的 STatistical Inference Relief (STIR) 估计器家族,该家族保留了识别交互作用的能力,同时将最近邻距离的样本方差纳入属性重要性估计中。这种方差允许计算特征的统计显著性,并调整 Relief 基得分的多重检验。具体来说,我们为病例对照数据开发了基于 Relief 的算法的伪 t 检验版本。
我们在一组模拟数据上展示了 STIR 特征选择方法家族的统计功效和 I 型错误控制,这些模拟数据表现出反映在真实基因表达数据中的特性,包括主效应和网络交互效应。我们比较了使用自适应半径方法作为最近邻构造器的 STIR 与使用固定 k 最近邻构造器的 STIR 的性能。我们将 STIR 应用于一项重度抑郁症研究的真实 RNA-Seq 数据,并讨论了 STIR 对全基因组关联研究的直接扩展。
代码和数据可在 http://insilico.utulsa.edu/software/STIR 上获得。
补充数据可在 Bioinformatics 在线获得。