Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France.
BMC Bioinformatics. 2020 Nov 10;21(1):513. doi: 10.1186/s12859-020-03855-1.
Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon-intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses.
We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins.
Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon-intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.
测序技术的最新进展导致了可用基因组数量的爆炸式增长,但准确的基因组注释仍然是一个主要挑战。真核生物基因组中蛋白质编码基因的预测尤其成问题,因为它们具有复杂的外显子-内含子结构。即使是最好的真核生物基因预测算法也会犯严重错误,这将显著影响后续分析。
我们首先在公共数据库中可用的十个灵长类动物蛋白质组的 176478 个蛋白质的大型集合中调查了基因预测错误的普遍性。使用研究充分的人类蛋白质作为参考,总共检测到 82305 个潜在的错误,包括 44001 个缺失、27289 个插入和 11015 个不匹配的片段,其中正确蛋白质序列的一部分被替换为替代错误序列。然后,我们专注于导致下游应用程序出现特殊问题的不匹配序列错误。详细的特征描述使我们能够确定这些情况下约一半(5446 个)基因误预测的潜在原因。作为概念验证,我们还开发了一种简单的方法,允许我们为 603 个灵长类动物蛋白质提出改进的序列。
灵长类动物蛋白质组中的基因预测错误影响多达 50%的序列。错误的主要原因包括未确定的基因组区域、基因组测序或组装问题,以及用于表示基因外显子-内含子结构的模型的局限性。尽管如此,现有的基因组序列仍然可以被利用来提高蛋白质序列质量。这项工作的展望包括其他类型的基因预测错误的特征描述,以及更全面的蛋白质序列错误纠正算法的开发。