Jansen Ronald, Lan Ning, Qian Jiang, Gerstein Mark
Department of Molecular Biophysics & Biochemistry, 266 Whitney Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA.
J Struct Funct Genomics. 2002;2(2):71-81. doi: 10.1023/a:1020495201615.
The ultimate goal of functional genomics is to define the function of all the genes in the genome of an organism. A large body of information of the biological roles of genes has been accumulated and aggregated in the past decades of research, both from traditional experiments detailing the role of individual genes and proteins, and from newer experimental strategies that aim to characterize gene function on a genomic scale. It is clear that the goal of functional genomics can only be achieved by integrating information and data sources from the variety of these different experiments. Integration of different data is thus an important challenge for bioinformatics. The integration of different data sources often helps to uncover non-obvious relationships between genes, but there are also two further benefits. First, it is likely that whenever information from multiple independent sources agrees, it should be more valid and reliable. Secondly, by looking at the union of multiple sources, one can cover larger parts of the genome. This is obvious for integrating results from multiple single gene or protein experiments, but also necessary for many of the results from genome-wide experiments since they are often confined to certain (although sizable) subsets of the genome. In this paper, we explore an example of such a data integration procedure. We focus on the prediction of membership in protein complexes for individual genes. For this, we recruit six different data sources that include expression profiles, interaction data, essentiality and localization information. Each of these data sources individually contains some weakly predictive information with respect to protein complexes, but we show how this prediction can be improved by combining all of them. Supplementary information is available at http:// bioinfo.mbb.yale.edu/integrate/interactions/.
功能基因组学的最终目标是确定生物体基因组中所有基因的功能。在过去几十年的研究中,已经积累并汇总了大量有关基因生物学作用的信息,这些信息既来自详细阐述单个基因和蛋白质作用的传统实验,也来自旨在在基因组规模上表征基因功能的更新实验策略。显然,只有通过整合来自这些不同实验的信息和数据源,才能实现功能基因组学的目标。因此,整合不同数据是生物信息学面临的一项重要挑战。整合不同数据源通常有助于揭示基因之间不明显的关系,但还有另外两个好处。首先,每当来自多个独立来源的信息一致时,它可能会更有效且可靠。其次,通过查看多个来源的并集,可以覆盖基因组的更大区域。这对于整合多个单基因或蛋白质实验的结果很明显,但对于许多全基因组实验的结果也是必要的,因为它们通常局限于基因组的某些(尽管规模较大)子集。在本文中,我们探讨了这样一个数据整合过程的示例。我们专注于预测单个基因在蛋白质复合物中的成员身份。为此,我们收集了六个不同的数据源,包括表达谱、相互作用数据、必需性和定位信息。这些数据源中的每一个单独对于蛋白质复合物都包含一些预测性较弱的信息,但我们展示了如何通过将所有这些信息结合起来改善这种预测。补充信息可在http://bioinfo.mbb.yale.edu/integrate/interactions/获取。