Suppr超能文献

原核生物基因组注释中缺失的基因。

Missing genes in the annotation of prokaryotic genomes.

机构信息

Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.

出版信息

BMC Bioinformatics. 2010 Mar 15;11:131. doi: 10.1186/1471-2105-11-131.

Abstract

BACKGROUND

Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.

RESULTS

We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.

CONCLUSIONS

Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

摘要

背景

与内含子的真核基因组相比,原核基因组中的蛋白质编码基因检测被认为是一个更为简单的问题。然而,已有报道称原核基因预测程序在处理小基因时存在问题(要么过度预测,要么预测不足)。因此,就出现了当前基因组注释是否存在系统缺失的小基因这一问题。

结果

我们开发了一种高性能计算方法来研究这个问题。在这种方法中,我们比较了所有完全测序的原核复制子中大于或等于 33 个氨基酸的所有 ORF。基于该比较,并使用要求不同基因组中保守 ORF 之间具有最小分类多样性的保守标准,我们发现了 1153 个当前基因组注释中缺失的候选基因。这些缺失的基因彼此相似,与公共数据库中的基因序列没有任何强相似性,这意味着这些 ORF 属于缺失的基因家族。我们还发现了 38895 个基因间 ORF,这些 ORF 通过与当前注释基因的相似性很容易被识别为潜在基因(我们称之为缺失注释)。发现的大多数缺失基因都很小(小于 100 个氨基酸)。与 GeneMark、EasyGene 和 Glimmer 预测的选择示例进行比较,为这些基因中的一些可能被这些程序检测不到提供了证据。

结论

原核基因预测程序和原核基因组注释需要改进,以准确预测小基因。由于使用了保守标准来确定 ORF 是否对应于真实基因,因此找到的缺失基因家族数量可能只是实际数量的下限。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d02d/3098052/3e5864d34113/1471-2105-11-131-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验