US2B, UMR 6286 of CNRS, Nantes University, rue de la Houssinière, 44322, Nantes, France.
J Mol Evol. 2023 Aug;91(4):492-501. doi: 10.1007/s00239-023-10116-1. Epub 2023 May 23.
To study unknown proteins on a large scale, a reference system has been set up for the three better studied eukaryotic kingdoms, built with 36 proteomes as taxonomically diverse as possible. Proteins from 362 other eukaryotic proteomes with no known homologue in this set were then analyzed, focusing noteworthy on singletons, that is, on such proteins with no known homologue in their own proteome. Consistently, for a given species, no more than 12% of the singletons thus found are known at the protein level, according to Uniprot. In addition, since they rely on the information found in the alignment of homologous sequences, predictions of AlphaFold2 for their tridimensional structure are poor. In the case of metazoan species, the number of singletons rarely exceeds 1000 for the species the closest to the reference system (divergence times below 75 Myr). Interestingly, in the cases of viridiplantae and fungi, larger amounts of singletons are found for such species, as if the timescale on which singletons are added to proteomes were different in metazoa and in other eukaryotic kingdoms. In order to confirm this phenomenon, further studies of proteomes closer to those of the reference system are, however, needed.
为了大规模研究未知蛋白质,已经为三个研究较好的真核生物王国建立了一个参考系统,其中包含尽可能多样化的 36 个蛋白质组。然后分析了来自其他 362 个真核蛋白质组的蛋白质,这些蛋白质在这个集合中没有已知的同源物,特别关注单体,也就是说,在它们自己的蛋白质组中没有已知同源物的蛋白质。一致地,根据 Uniprot,对于给定的物种,在这种情况下发现的单体中不超过 12%是在蛋白质水平上已知的。此外,由于它们依赖于同源序列比对中发现的信息,因此 AlphaFold2 对其三维结构的预测很差。在后生动物物种的情况下,对于与参考系统最接近的物种(分歧时间低于 7500 万年),单体的数量很少超过 1000 个。有趣的是,在绿藻门和真菌门的情况下,对于这些物种,发现了更多的单体,好像单体添加到蛋白质组的时间尺度在后生动物和其他真核生物王国中是不同的。为了证实这一现象,然而,需要对更接近参考系统的蛋白质组进行进一步研究。