White S H
Department of Physiology and Biophysics, University of California, Irvine 92717.
J Mol Evol. 1994 Apr;38(4):383-94. doi: 10.1007/BF00163155.
This paper continues an examination of the hypothesis that modern proteins evolved from random heteropeptide sequences. In support of the hypothesis, White and Jacobs (1993, J Mol Evol 36:79-95) have shown that any sequence chosen randomly from a large collection of nonhomologous proteins has a 90% or better chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. The goal of the present study was to investigate the possibility that the random-origin hypothesis could explain the lengths of modern protein sequences without invoking specific mechanisms such as gene duplication or exon splicing. The sets of sequences examined were taken from the 1989 PIR database and consisted of 1,792 "super-family" proteins selected to have little sequence identity, 623 E. coli sequences, and 398 human sequences. The length distributions of the proteins could be described with high significance by either of two closely related probability density functions: The gamma distribution with parameter 2 or the distribution for the sum of two exponential random independent variables. A simple theory for the distributions was developed which assumes that (1) protoprotein sequences had exponentially distributed random independent lengths, (2) the length dependence of protein stability determined which of these protoproteins could fold into compact primitive proteins and thereby attain the potential for biochemical activity, (3) the useful protein sequences were preserved by the primitive genome, and (4) the resulting distribution of sequence lengths is reflected by modern proteins. The theory successfully predicts the two observed distributions which can be distinguished by the functional form of the dependence of protein stability on length. The theory leads to three interesting conclusions. First, it predicts that a tetra-nucleotide was the signal for primitive translation termination. This prediction is entirely consistent with the observations of Brown et al. (1990a,b, Nucleic Acids Res 18:2079-2086 and 18: 6339-6345) which show that tetra-nucleotides (stop codon plus following nucleotide) are the actual signals for termination of translation in both prokaryotes and eukaryotes. Second, the strong dependence of statistical length distributions on sequence-termination signaling codes implies that the evolution of stop codons and translation-termination processes was as important as gene splicing in early evolution. Third, because the theory is based upon a simple no-exon stochastic model, it provides a plausible alternative to a limited universe of exons from which all proteins evolved by gene duplication and exon splicing (Dorit et al. 1990, Science 250:1377-1382).
本文继续探讨现代蛋白质是否由随机异肽序列进化而来这一假说。作为对该假说的支持,怀特和雅各布斯(1993年,《分子进化杂志》36卷:79 - 95页)指出,从大量非同源蛋白质中随机选取的任何序列,无论氨基酸类型如何,其氨基酸的纵向分布有90%或更高的概率与随机预期难以区分。本研究的目的是调查随机起源假说能否在不涉及基因复制或外显子剪接等特定机制的情况下,解释现代蛋白质序列的长度。所研究的序列集取自1989年的蛋白质信息资源(PIR)数据库,包括1792个经挑选具有低序列同一性的“超家族”蛋白质、623个大肠杆菌序列和398个人类序列。蛋白质的长度分布可以用两个密切相关的概率密度函数中的任何一个进行高度显著的描述:参数为2的伽马分布或两个指数随机独立变量之和的分布。针对这些分布,我们提出了一个简单的理论,该理论假定:(1)原蛋白质序列具有指数分布的随机独立长度;(2)蛋白质稳定性对长度的依赖性决定了哪些原蛋白质能够折叠成紧密的原始蛋白质,从而获得生化活性的潜力;(3)有用的蛋白质序列被原始基因组保留;(4)现代蛋白质反映了由此产生的序列长度分布。该理论成功预测了两个观察到的分布,这两个分布可以通过蛋白质稳定性对长度依赖性的函数形式来区分。该理论得出了三个有趣的结论。第一,它预测四核苷酸是原始翻译终止的信号。这一预测与布朗等人(1990年a、b,《核酸研究》18卷:2079 - 2086页和18卷:6339 - 6345页)的观察结果完全一致,这些观察结果表明四核苷酸(终止密码子加随后的核苷酸)是原核生物和真核生物中翻译终止的实际信号。第二,统计长度分布对序列终止信号密码的强烈依赖性意味着终止密码子和翻译终止过程的进化在早期进化中与基因剪接同样重要。第三,由于该理论基于一个简单的无外显子随机模型,它为所有蛋白质通过基因复制和外显子剪接从有限的外显子库进化而来这一观点提供了一个合理的替代方案(多里特等人,1990年,《科学》250卷:1377 - 1382页)。