将ENCODE数据外推至整个人类基因组。

Extrapolating ENCODE data to the whole human genome.

作者信息

Costantini Maria, Di Filippo Miriam, Bernardi Giorgio

机构信息

Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy.

出版信息

Gene. 2008 Aug 1;419(1-2):66-9. doi: 10.1016/j.gene.2008.02.013. Epub 2008 Feb 21.

DOI:10.1016/j.gene.2008.02.013

PMID:18378099

Abstract

The ENCODE (ENCyclopedia Of DNA Elements) project was launched three years ago with the purpose of identifying all of the functional elements in the human genome. ENCODE was started with 44 target sequences, which comprise 1% of the human genome. A crucial question about ENCODE is how representative it is of the human genome. Indeed, this is not a negligible problem if one considers that only 1% of the genome was selected for the project, and, more importantly, that the choice of the large DNA segments was based on two major criteria, namely the presence of extensively characterized genes and/or other functional elements, and the availability of a substantial amount of comparative sequence data. We found that the ENCODE data lead to an unbalanced representation of the compositional pattern of the human genome, especially for the GC-poorest and GC-richest regions. This unbalanced representativity of ENCODE can, however, be corrected by multiplying ENCODE data by a G/E factor (the ratio of whole genome data over ENCODE data), so amplifying the potential interest of ENCODE.

摘要

ENCODE（DNA元件百科全书）计划于三年前启动，旨在识别人类基因组中的所有功能元件。ENCODE始于44个目标序列，它们构成了人类基因组的1%。关于ENCODE的一个关键问题是它对人类基因组的代表性如何。事实上，如果考虑到该计划仅选择了1%的基因组，而且更重要的是，大DNA片段的选择基于两个主要标准，即广泛表征的基因和/或其他功能元件的存在，以及大量比较序列数据的可用性，那么这并非一个可以忽略的问题。我们发现，ENCODE数据导致人类基因组组成模式的代表性不均衡，尤其是对于GC含量最低和最高的区域。然而，ENCODE这种不均衡的代表性可以通过将ENCODE数据乘以一个G/E因子（全基因组数据与ENCODE数据的比率）来校正，从而增强ENCODE的潜在价值。