通过定制最近邻算法对基因表达数据进行缺失值插补

Missing value imputation for gene expression data by tailored nearest neighbors.

作者信息

Faisal Shahla, Tutz Gerhard

机构信息

出版信息

Stat Appl Genet Mol Biol. 2017 Apr 25;16(2):95-106. doi: 10.1515/sagmb-2015-0098.

DOI:10.1515/sagmb-2015-0098

PMID:28593876

Abstract

High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

摘要

像基因表达和RNA序列这样的高维数据常常包含缺失值。基于这些不完整数据的后续分析和结果可能会因这些缺失值的存在而受到严重影响。已经开发了几种用于估算基因表达数据中缺失值的方法，但由于数据的高维度（基因数量），这项任务很困难。本文提出了一种使用加权最近邻的估算程序。不是使用由包含所有基因的距离定义的最近邻，而是针对有助于提高估算值准确性的基因计算距离。该方法旨在避免维度诅咒，维度诅咒通常在高维设置中应用像最近邻这样的局部方法时出现。将提出的加权最近邻算法与现有的缺失值估算技术进行比较，如均值估算、KNNimpute和最近提出的随机森林估算。我们使用来自人类癌症研究的RNA序列和微阵列数据来比较这些方法的性能。模拟以及实际研究的结果表明，加权距离程序能够成功处理预测变量数量大于样本数量的高维数据结构中的缺失值。该方法通常优于所考虑的竞争对手。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过定制最近邻算法对基因表达数据进行缺失值插补

Missing value imputation for gene expression data by tailored nearest neighbors.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

通过定制最近邻算法对基因表达数据进行缺失值插补

Missing value imputation for gene expression data by tailored nearest neighbors.

作者信息

机构信息

出版信息

相似文献

引用本文的文献