Suppr超能文献

处理大规模研究中的缺失值:微阵列数据插补及其他方法。

Dealing with missing values in large-scale studies: microarray data imputation and beyond.

机构信息

Biomathematics Research Group, Department of Mathematics, FI-20014 University of Turku, Finland.

出版信息

Brief Bioinform. 2010 Mar;11(2):253-64. doi: 10.1093/bib/bbp059. Epub 2009 Dec 4.

Abstract

High-throughput biotechnologies, such as gene expression microarrays or mass-spectrometry-based proteomic assays, suffer from frequent missing values due to various experimental reasons. Since the missing data points can hinder downstream analyses, there exists a wide variety of ways in which to deal with missing values in large-scale data sets. Nowadays, it has become routine to estimate (or impute) the missing values prior to the actual data analysis. After nearly a decade since the publication of the first missing value imputation methods for gene expression microarray data, new imputation approaches are still being developed at an increasing rate. However, what is lagging behind is a systematic and objective evaluation of the strengths and weaknesses of the different approaches when faced with different types of data sets and experimental questions. In this review, the present strategies for missing value imputation and the measures for evaluating their performance are described. The imputation methods are first reviewed in the context of gene expression microarray data, since most of the methods have been developed for estimating gene expression levels; then, we turn to other large-scale data sets that also suffer from the problems posed by missing values, together with pointers to possible imputation approaches in these settings. Along with a description of the basic principles behind the different imputation approaches, the review tries to provide practical guidance for the users of high-throughput technologies on how to choose the imputation tool for their data and questions, and some additional research directions for the developers of imputation methodologies.

摘要

高通量生物技术,如基因表达微阵列或基于质谱的蛋白质组学分析,由于各种实验原因经常出现缺失值。由于缺失数据点会阻碍下游分析,因此在处理大规模数据集的缺失值方面存在多种方法。如今,在实际数据分析之前估计(或插补)缺失值已成为常规操作。自首次发表用于基因表达微阵列数据的缺失值插补方法以来,近十年过去了,新的插补方法仍在以越来越快的速度开发。然而,滞后的是在面对不同类型的数据集和实验问题时,对不同方法的优缺点进行系统和客观的评估。在这篇综述中,描述了缺失值插补的现有策略和评估其性能的措施。插补方法首先在基因表达微阵列数据的上下文中进行了回顾,因为大多数方法都是为了估计基因表达水平而开发的;然后,我们转向其他也受到缺失值问题影响的大规模数据集,以及在这些环境中可能的插补方法的指向。除了描述不同插补方法背后的基本原理外,本综述还试图为高通量技术的用户提供实用指南,指导他们如何为自己的数据和问题选择插补工具,并为插补方法的开发者提供一些额外的研究方向。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验