强大的数据插补

Robust data imputation.

作者信息

Branden Karlien Vanden, Verboven Sabine

机构信息

Joint Research Centre, TP 361, 21020 Ispra VA, Italy.

出版信息

Comput Biol Chem. 2009 Feb;33(1):7-13. doi: 10.1016/j.compbiolchem.2008.07.019. Epub 2008 Jul 18.

DOI:10.1016/j.compbiolchem.2008.07.019

PMID:18771957

Abstract

Single imputation methods have been wide-discussed topics among researchers in the field of bioinformatics. One major shortcoming of methods proposed until now is the lack of robustness considerations. Like all data, gene expression data can possess outlying values. The presence of these outliers could have negative effects on the imputated values for the missing values. Afterwards, the outcome of any statistical analysis on the completed data could lead to incorrect conclusions. Therefore it is important to consider the possibility of outliers in the data set, and to evaluate how imputation techniques will handle these values. In this paper, a simulation study is performed to test existing techniques for data imputation in case outlying values are present in the data. To overcome some shortcomings of the existing imputation techniques, a new robust imputation method that can deal with the presence of outliers in the data is introduced. In addition, the robust imputation procedure cleans the data for further statistical analysis. Moreover, this method can be easily extended towards a multiple imputation approach by which the uncertainty of the imputed values is emphasised. Finally, a classification example illustrates the lack of robustness of some existing imputation methods and shows the advantage of the multiple imputation approach of the new robust imputation technique.

摘要

单重插补方法一直是生物信息学领域研究人员广泛讨论的话题。到目前为止提出的方法的一个主要缺点是缺乏对稳健性的考虑。与所有数据一样，基因表达数据可能存在离群值。这些离群值的存在可能会对缺失值的插补值产生负面影响。之后，对完整数据进行的任何统计分析结果都可能导致错误的结论。因此，考虑数据集中存在离群值的可能性，并评估插补技术将如何处理这些值非常重要。在本文中，进行了一项模拟研究，以测试在数据中存在离群值的情况下现有的数据插补技术。为了克服现有插补技术的一些缺点，引入了一种新的稳健插补方法，该方法可以处理数据中存在的离群值。此外，稳健插补过程会清理数据以进行进一步的统计分析。而且，该方法可以很容易地扩展为多重插补方法，通过这种方法可以强调插补值的不确定性。最后，一个分类示例说明了一些现有插补方法缺乏稳健性，并展示了新的稳健插补技术的多重插补方法的优势。