Levin Michal, Butter Falk
Institute of Molecular Biology (IMB), 55128 Mainz, Germany.
Comput Struct Biotechnol J. 2022 Jul 9;20:3667-3675. doi: 10.1016/j.csbj.2022.07.007. eCollection 2022.
Applications in omics research, such as comparative transcriptomics and proteomics, require the knowledge of the species-specific gene sequence and benefit from a comprehensive high-quality annotation of the coding genes to achieve high coverage. While protein-coding genes can in simple cases be detected by scanning the genome for open reading frames, in more complex genomes exonic sequences are separated by introns. Despite advances in sequencing technologies that allow for ever-growing numbers of genomes, the quality of many of the provided genome assemblies do not reach reference quality. These non-contiguous assemblies with gaps and the necessity to predict splice sites limit accurate gene annotation from solely genomic data. In contrast, the transcriptome only contains transcribed gene regions, is devoid of introns and thus provides the optimal basis for the identification of open reading frames. The additional integration of proteomics data to validate predicted protein-coding genes further enriches for accurate gene models. This review outlines the principles of the proteotranscriptomics approach, discusses common challenges and suggests methods for improvement.
在组学研究中的应用,如比较转录组学和蛋白质组学,需要了解物种特异性基因序列,并受益于对编码基因的全面高质量注释以实现高覆盖率。虽然在简单情况下,可以通过扫描基因组寻找开放阅读框来检测蛋白质编码基因,但在更复杂的基因组中,外显子序列被内含子隔开。尽管测序技术取得了进步,使得基因组数量不断增加,但许多提供的基因组组装质量并未达到参考质量。这些有缺口的非连续组装以及预测剪接位点的必要性限制了仅从基因组数据进行准确的基因注释。相比之下,转录组仅包含转录的基因区域,没有内含子,因此为开放阅读框的识别提供了最佳基础。蛋白质组学数据的额外整合以验证预测的蛋白质编码基因进一步丰富了准确的基因模型。本综述概述了蛋白质转录组学方法的原理,讨论了常见挑战并提出了改进方法。