Hamid Jemila S, Hu Pingzhao, Roslin Nicole M, Ling Vicki, Greenwood Celia M T, Beyene Joseph
Biostatistics Methodology Unit, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada M5G 1X8.
Hum Genomics Proteomics. 2009 Jan 12;2009:869093. doi: 10.4061/2009/869093.
Due to rapid technological advances, various types of genomic and proteomic data with different sizes, formats, and structures have become available. Among them are gene expression, single nucleotide polymorphism, copy number variation, and protein-protein/gene-gene interactions. Each of these distinct data types provides a different, partly independent and complementary, view of the whole genome. However, understanding functions of genes, proteins, and other aspects of the genome requires more information than provided by each of the datasets. Integrating data from different sources is, therefore, an important part of current research in genomics and proteomics. Data integration also plays important roles in combining clinical, environmental, and demographic data with high-throughput genomic data. Nevertheless, the concept of data integration is not well defined in the literature and it may mean different things to different researchers. In this paper, we first propose a conceptual framework for integrating genetic, genomic, and proteomic data. The framework captures fundamental aspects of data integration and is developed taking the key steps in genetic, genomic, and proteomic data fusion. Secondly, we provide a review of some of the most commonly used current methods and approaches for combining genomic data with focus on the statistical aspects.
由于技术的飞速发展,各种不同大小、格式和结构的基因组和蛋白质组数据已变得可用。其中包括基因表达、单核苷酸多态性、拷贝数变异以及蛋白质-蛋白质/基因-基因相互作用。这些不同的数据类型中的每一种都提供了关于整个基因组的不同的、部分独立且互补的视角。然而,理解基因、蛋白质以及基因组的其他方面的功能需要比每个数据集所提供的更多信息。因此,整合来自不同来源的数据是当前基因组学和蛋白质组学研究的重要组成部分。数据整合在将临床、环境和人口统计学数据与高通量基因组数据相结合方面也发挥着重要作用。尽管如此,数据整合的概念在文献中并未得到很好的定义,对不同的研究人员可能意味着不同的事情。在本文中,我们首先提出一个用于整合遗传、基因组和蛋白质组数据的概念框架。该框架涵盖了数据整合的基本方面,并结合了遗传、基因组和蛋白质组数据融合的关键步骤来构建。其次,我们对当前一些最常用的将基因组数据相结合的方法和途径进行综述,重点关注统计方面。