蛋白质编码基因、非编码基因以及基因间人类DNA中的独特序列特征。

Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.

作者信息

Guigó R, Fickett J W

机构信息

Theoretical Biology and Biophysics Group Los Alamos National Laboratory, NM 87545, USA.

出版信息

J Mol Biol. 1995 Oct 13;253(1):51-60. doi: 10.1006/jmbi.1995.0535.

DOI:10.1006/jmbi.1995.0535

PMID:7473716

Abstract

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.

摘要

我们研究了许多主要指示蛋白质编码功能的序列统计量在一组在基因组作图过程中随机选择的人类克隆序列（随机选择的克隆序列）中的行为，并将其与包含基因的已知序列（我们称为基因序列）中的行为进行了比较。正如预期的那样，鉴于基因序列的编码密度更高，所研究的序列统计量在随机选择的克隆序列（主要是基因间DNA）和基因序列中的行为方式有很大不同。然而，当将随机选择的克隆序列仅与基因序列的非编码部分进行比较时，也观察到许多此类统计量在行为上有强烈差异，这表明基因间和基因非编码DNA构成了两类不同的非编码DNA。通过研究不同C+G含量的模拟DNA中序列统计量的行为，我们观察到其中许多统计量强烈依赖于C+G含量。因此，基因间和基因非编码DNA之间的大多数差异可以用C+G含量的差异来解释。富含A+T的基因间DNA似乎处于随机突变预期的组成平衡状态，而富含C+G的非编码基因DNA则远非这种平衡状态。另一方面，在模拟DNA中获得的结果表明，基因识别算法所依据的编码统计量中很大一部分变异仅仅是由于C+G含量，而与蛋白质编码功能没有直接关系。因此，似乎通过仔细区分蛋白质编码功能的影响与这种编码统计量中单纯碱基组成变异的影响，基因发现算法性能应该可以得到提高。