Guigó Roderic
Bioinformatics and Genomics, Center for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Catalonia.
Universitat Pompeu Fabra (UPF), Barcelona, Catalonia.
Cell Genom. 2023 Aug 1;3(8):100375. doi: 10.1016/j.xgen.2023.100375. eCollection 2023 Aug 9.
Within the next decade, the genomes of 1.8 million eukaryotic species will be sequenced. Identifying genes in these sequences is essential to understand the biology of the species. This is challenging due to the transcriptional complexity of eukaryotic genomes, which encode hundreds of thousands of transcripts of multiple types. Among these, a small set of protein-coding mRNAs play a disproportionately large role in defining phenotypes. Due to their sequence conservation, orthology can be established, making it possible to define the universal catalog of eukaryotic protein-coding genes. This catalog should substantially contribute to uncovering the genomic events underlying the emergence of eukaryotic phenotypes. This piece briefly reviews the basics of protein-coding gene prediction, discusses challenges in finalizing annotation of the human genome, and proposes strategies for producing annotations across the eukaryotic Tree of Life. This lays the groundwork for obtaining the catalog of all genes-the Earth's code of life.
在未来十年内,将对180万个真核生物物种的基因组进行测序。识别这些序列中的基因对于理解物种生物学至关重要。由于真核生物基因组的转录复杂性,这一过程颇具挑战,真核生物基因组编码了数十万种多种类型的转录本。其中,一小部分蛋白质编码mRNA在定义表型方面发挥着 disproportionately large 的作用。由于它们的序列保守性,可以建立直系同源关系,从而有可能定义真核生物蛋白质编码基因的通用目录。该目录应在很大程度上有助于揭示真核生物表型出现背后的基因组事件。本文简要回顾了蛋白质编码基因预测的基础知识,讨论了完成人类基因组注释的挑战,并提出了在真核生物生命之树上进行注释的策略。这为获取所有基因的目录——地球生命密码奠定了基础。 (注:disproportionately large 未准确翻译出其含义,建议结合语境调整为“不成比例的大”之类更合适的表述)