简化氨基酸替换矩阵在现代蛋白质中发现古代编码字母表的痕迹。

Reduced Amino Acid Substitution Matrices Find Traces of Ancient Coding Alphabets in Modern Day Proteins.

作者信息

Douglas Jordan, Bouckaert Remco, Carter Charles W, Wills Peter R

机构信息

Department of Physics, The University of Auckland, Auckland, New Zealand.

Centre for Computational Evolution, The University of Auckland, Auckland, New Zealand.

出版信息

Mol Biol Evol. 2025 Sep 1;42(9). doi: 10.1093/molbev/msaf197.

DOI:10.1093/molbev/msaf197

PMID:40796178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12402984/

Abstract

All known living systems make proteins from the same 20 canonically coded amino acids, but this was not always the case. Early genetic coding systems likely operated with a restricted pool of amino acid types and limited means to distinguish between them. Despite this, amino acid substitution models like LG and WAG all assume a constant coding alphabet over time. That makes them especially inappropriate for the aminoacyl-tRNA synthetases (aaRS)-the enzymes that govern translation. To address this limitation, we created a class of substitution models that account for evolutionary changes in the coding alphabet size by defining the transition from 19 states in a past epoch to 20 now. We use a Bayesian phylogenetic framework to improve phylogeny estimation and testing of this two-alphabet hypothesis. The hypothesis was strongly rejected by datasets composed exclusively of "young" eukaryotic proteins. It was generally supported by "old" (aaRS and non-aaRS) proteins whose origins date from before the last universal common ancestor. Standard methods overestimate the divergence ages of proteins that originated under reduced coding alphabets in both simulated and aaRS alignments. The new model provides a timeline slightly more consistent with the Earth's history. Our findings suggest that aaRS functional bifurcation events can explain much of the genetic code's evolution, but there remain other unknown forces at play too. This work provides a robust, seamless framework for reconstructing phylogenies from ancient protein datasets and offers further insights into the dawn of molecular biology.

摘要

所有已知的生命系统都由相同的20种标准编码氨基酸合成蛋白质，但情况并非一直如此。早期的遗传编码系统可能是在有限的氨基酸类型库和有限的区分手段下运作的。尽管如此，像LG和WAG这样的氨基酸替换模型都假定编码字母表随时间是恒定的。这使得它们特别不适用于氨酰-tRNA合成酶（aaRS）——即控制翻译的酶。为了解决这一局限性，我们创建了一类替换模型，通过定义从过去某个时期的19种状态到现在的20种状态的转变，来解释编码字母表大小的进化变化。我们使用贝叶斯系统发育框架来改进系统发育估计和对这个双字母表假说的检验。该假说被完全由“年轻”的真核生物蛋白质组成的数据集强烈拒绝。它通常得到“古老”（aaRS和非aaRS）蛋白质的支持，这些蛋白质的起源可追溯到最后一个普遍共同祖先之前。在模拟和aaRS比对中，标准方法高估了在编码字母表减少的情况下起源的蛋白质的分歧年龄。新模型提供了一个与地球历史稍更一致的时间线。我们的研究结果表明，aaRS功能分歧事件可以解释遗传密码进化的大部分情况，但也有其他未知力量在起作用。这项工作为从古代蛋白质数据集中重建系统发育提供了一个强大、无缝的框架，并为分子生物学的起源提供了进一步的见解。