Cohen Barry, Skiena Steven
Department of Computer Science, University Heights, New Jersey Institute of Technology, Newark, NJ 07102, USA.
J Comput Biol. 2003;10(3-4):419-32. doi: 10.1089/10665270360688101.
Messenger RNA (mRNA) sequences serve as templates for proteins according to the triplet code, in which each of the 4(3) = 64 different codons (sequences of three consecutive nucleotide bases) in RNA either terminate transcription or map to one of the 20 different amino acids (or residues) which build up proteins. Because there are more codons than residues, there is inherent redundancy in the coding. Certain residues (e.g., tryptophan) have only a single corresponding codon, while other residues (e.g., arginine) have as many as six corresponding codons. This freedom implies that the number of possible RNA sequences coding for a given protein grows exponentially in the length of the protein. Thus nature has wide latitude to select among mRNA sequences which are informationally equivalent, but structurally and energetically divergent. In this paper, we explore how nature takes advantage of this freedom and how to algorithmically design structures more energetically favorable than have been built through natural selection. In particular: (1) Natural Selection--we perform the first large-scale computational experiment comparing the stability of mRNA sequences from a variety of organisms to random synonymous sequences which respect the codon preferences of the organism. This experiment was conducted on over 27,000 sequences from 34 microbial species with 36 genomic structures. We provide evidence that in all genomic structures highly stable sequences are disproportionately abundant, and in 19 of 36 cases highly unstable sequences are disproportionately abundant. This suggests that the stability of mRNA sequences is subject to natural selection. (2) Artificial Selection--motivated by these biological results, we examine the algorithmic problem of designing the most stable and unstable mRNA sequences which code for a target protein. We give a polynomial-time dynamic programming solution to the most stable sequence problem (MSSP), which is asymptotically no more complex than secondary structure prediction. We show that the corresponding least stable sequence problem (LSSP) is NP-complete, and develop two heuristics for the construction of such sequences. We have implemented these algorithms, and present experimental results placing the high/low stability sequences in context with both wildtype and random encodings. Our implementation has already been applied to the design of RNA "code-words" creating little or no secondary structure in RNA computing (Brenneman and Condon, 2001; Marathe et al., 2001), and we anticipate a variety of other applications of this work to sequence design problems (Skiena, 2001).
信使核糖核酸(mRNA)序列根据三联体密码作为蛋白质的模板,其中RNA中4³ = 64种不同密码子(三个连续核苷酸碱基的序列)中的每一个要么终止转录,要么映射到构成蛋白质的20种不同氨基酸(或残基)之一。由于密码子比残基多,编码中存在固有的冗余。某些残基(例如色氨酸)只有一个相应的密码子,而其他残基(例如精氨酸)有多达六个相应的密码子。这种自由度意味着编码给定蛋白质的可能RNA序列的数量随蛋白质长度呈指数增长。因此,自然界有很大的自由度在信息上等效但结构和能量上不同的mRNA序列中进行选择。在本文中,我们探讨了自然界如何利用这种自由度,以及如何通过算法设计出比通过自然选择构建的结构在能量上更有利的结构。具体而言:(1)自然选择——我们进行了首次大规模计算实验,比较了来自各种生物体的mRNA序列与遵循生物体密码子偏好的随机同义序列的稳定性。该实验对来自34种微生物物种的27000多个序列和36种基因组结构进行了研究。我们提供的证据表明,在所有基因组结构中,高度稳定的序列不成比例地丰富,在36种情况中的19种中,高度不稳定的序列不成比例地丰富。这表明mRNA序列的稳定性受到自然选择的影响。(2)人工选择——受这些生物学结果的启发,我们研究了设计编码目标蛋白质的最稳定和最不稳定mRNA序列的算法问题。我们给出了最稳定序列问题(MSSP)的多项式时间动态规划解决方案,其渐近复杂度不超过二级结构预测。我们表明相应的最不稳定序列问题(LSSP)是NP完全问题,并开发了两种构建此类序列的启发式方法。我们已经实现了这些算法,并展示了将高/低稳定性序列与野生型和随机编码相关联的实验结果。我们的实现已经应用于RNA“码字”的设计,这些“码字”在RNA计算中几乎不产生或不产生二级结构(Brenneman和Condon,2001;Marathe等人,2001),并且我们预计这项工作在序列设计问题上还有各种其他应用(Skiena,2001)。