Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America.
PLoS One. 2012;7(5):e36624. doi: 10.1371/journal.pone.0036624. Epub 2012 May 18.
Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show the presence of some notable general features. These include essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter.
Simply by assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be particularly versatile and therefore be controlled by a variety of (unspecified) probability distribution functions (pdf's), we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and must therefore have a specific logarithmic form. Using the data for the 1000+ genomes available to us in early 2010, we find that the Benford distribution provides excellent fits to the data over several orders of magnitude.
In its linear regime the Benford distribution produces excellent fits to the Prokaryote data, while the full non-linear form of the distribution similarly provides an excellent fit to the Eukaryote data. Furthermore, in their region of overlap the salient features are statistically congruent. This allows us to interpret the difference between Prokaryotes and Eukaryotes as the manifestation of the increased demand in the biological functions required for the larger Eukaryotes, to estimate some minimal genome sizes, and to predict a maximal Prokaryote genome size on the order of 8-12 megabasepairs. These results naturally allow a mathematical interpretation in terms of maximal entropy and, therefore, most efficient information transmission.
来自生命的三个领域的基因组编码的开放阅读框(ORF)数量的数据显示出一些显著的一般特征。这些特征包括原核生物和真核生物之间的基本差异,前者的 ORF 数量随总基因组大小呈线性增长,而后者则仅呈对数增长。
仅通过假设基因组的(蛋白质)编码和非编码部分必须具有不同的动态特性,并且非编码部分必须特别灵活,因此受到各种(未指定)概率分布函数(pdf)的控制,我们能够预测真核生物的 ORF 数量遵循贝努利分布,因此必须具有特定的对数形式。使用我们在 2010 年初获得的 1000 多个基因组的数据,我们发现贝努利分布在几个数量级上都能很好地拟合数据。
在其线性范围内,贝努利分布对原核生物数据产生了极好的拟合,而分布的完整非线性形式同样对真核生物数据提供了极好的拟合。此外,在它们的重叠区域,显著特征在统计学上是一致的。这使我们能够将原核生物和真核生物之间的差异解释为生物功能需求增加的表现,这些生物功能是真核生物所必需的,以估计一些最小的基因组大小,并预测最大的原核生物基因组大小约为 8-12 兆碱基对。这些结果自然允许从最大熵的角度进行数学解释,因此是最有效的信息传输。