原核生物基因组的转录起始位点（TIS）注释的计算评估

Computational evaluation of TIS annotation for prokaryotic genomes.

作者信息

Hu Gang-Qing, Zheng Xiaobin, Ju Li-Ning, Zhu Huaiqiu, She Zhen-Su

机构信息

State Key Lab for Turbulence and Complex System and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China.

出版信息

BMC Bioinformatics. 2008 Mar 25;9:160. doi: 10.1186/1471-2105-9-160.

DOI:10.1186/1471-2105-9-160

PMID:18366730

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2362131/

Abstract

BACKGROUND

Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks.

RESULTS

Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, i.e. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes.

CONCLUSION

Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.

摘要

背景

准确注释翻译起始位点（TIS）对于理解翻译起始机制至关重要。然而，由于缺乏实验基准，诸如RefSeq等广泛使用的数据库中TIS注释的可靠性尚不确定。

结果

基于基因翻译相关信号在基因组中均匀分布的同质性假设，我们建立了一种计算方法，用于大规模定量评估任何原核生物基因组TIS注释的可靠性。该方法包括根据三个基本位置权重矩阵（PWM）的线性组合对预测TIS周围的比对序列的PWM进行建模，其中一个用于真实TIS，另外两个用于错误TIS。这三个基本PWM是使用具有高度可靠TIS预测的参考集获得的。广义最小二乘估计器确定观察到的PWM中真实TIS的权重，由此得出预测的准确性。通过对参考集准确性不同的经实验验证的TIS进行测试，明确解决了该方法的有效性和假设局限性的程度。该方法用于估计公共数据库（如RefSeq和ProTISA）以及EasyGene、GeneMarkS、Glimmer 3和TiCo等程序提供的TIS注释的准确性。结果表明，RefSeq的TIS预测明显不如两个最新的预测器Tico和ProTISA准确。通过令人信服的证据，我们展示了RefSeq注释中的两种普遍的优先偏差，即对最长开放阅读框（LORF）注释过多和对ATG起始密码子注释不足。最后，我们基于所有预测器的最佳预测建立了一个新的TIS数据库SupTISA；SupTISA在所有532个完整基因组上的平均准确率达到了92%。

结论

已实现对TIS注释的大规模计算评估。构建了一个比RefSeq好得多的新TIS数据库，为进一步的TIS研究提供了宝贵资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c30c/2362131/102d574e748f/1471-2105-9-160-1.jpg

相似文献

Computational evaluation of TIS annotation for prokaryotic genomes.

BMC Bioinformatics. 2008 Mar 25;9:160. doi: 10.1186/1471-2105-9-160.

Accuracy improvement for identifying translation initiation sites in microbial genomes.

Bioinformatics. 2004 Dec 12;20(18):3308-17. doi: 10.1093/bioinformatics/bth390. Epub 2004 Jul 9.

An unsupervised classification scheme for improving predictions of prokaryotic TIS.

BMC Bioinformatics. 2006 Mar 9;7:121. doi: 10.1186/1471-2105-7-121.

ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes.

Nucleic Acids Res. 2008 Jan;36(Database issue):D114-9. doi: 10.1093/nar/gkm799. Epub 2007 Oct 16.

TICO: a tool for improving predictions of prokaryotic translation initiation sites.

Bioinformatics. 2005 Sep 1;21(17):3568-9. doi: 10.1093/bioinformatics/bti563. Epub 2005 Jun 30.

MetWAMer: eukaryotic translation initiation site prediction.

BMC Bioinformatics. 2008 Sep 18;9:381. doi: 10.1186/1471-2105-9-381.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Nucleic Acids Res. 2001 Jun 15;29(12):2607-18. doi: 10.1093/nar/29.12.2607.

Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes.

Bioinformatics. 1999 Nov;15(11):874-86. doi: 10.1093/bioinformatics/15.11.874.

引用本文的文献

A Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites.

PLoS One. 2015 Jul 23;10(7):e0133691. doi: 10.1371/journal.pone.0133691. eCollection 2015.

Gene prediction in metagenomic fragments based on the SVM algorithm.

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S12. doi: 10.1186/1471-2105-14-S5-S12. Epub 2013 Apr 10.

ORFcor: identifying and accommodating ORF prediction inconsistencies for phylogenetic analysis.

PLoS One. 2013;8(3):e58387. doi: 10.1371/journal.pone.0058387. Epub 2013 Mar 6.

Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes.

BMC Genomics. 2011 Jul 12;12:361. doi: 10.1186/1471-2164-12-361.

FragGeneScan: predicting genes in short and error-prone reads.

Nucleic Acids Res. 2010 Nov;38(20):e191. doi: 10.1093/nar/gkq747. Epub 2010 Aug 30.

Genome reannotation of Escherichia coli CFT073 with new insights into virulence.

BMC Genomics. 2009 Nov 22;10:552. doi: 10.1186/1471-2164-10-552.

Adaptation of the short intergenic spacers between co-directional genes to the Shine-Dalgarno motif among prokaryote genomes.

BMC Genomics. 2009 Nov 18;10:537. doi: 10.1186/1471-2164-10-537.

Representative transcript sets for evaluating a translational initiation sites predictor.

BMC Bioinformatics. 2009 Jul 2;10:206. doi: 10.1186/1471-2105-10-206.

本文引用的文献

ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes.

Nucleic Acids Res. 2008 Jan;36(Database issue):D114-9. doi: 10.1093/nar/gkm799. Epub 2007 Oct 16.

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

BMC Bioinformatics. 2007 Mar 16;8:97. doi: 10.1186/1471-2105-8-97.

Identifying bacterial genes and endosymbiont DNA with Glimmer.

Bioinformatics. 2007 Mar 15;23(6):673-9. doi: 10.1093/bioinformatics/btm009. Epub 2007 Jan 19.

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27.

Predicting Shine-Dalgarno sequence locations exposes genome annotation errors.

PLoS Comput Biol. 2006 May;2(5):e57. doi: 10.1371/journal.pcbi.0020057. Epub 2006 May 19.

An unsupervised classification scheme for improving predictions of prokaryotic TIS.

BMC Bioinformatics. 2006 Mar 9;7:121. doi: 10.1186/1471-2105-7-121.

Large-scale prokaryotic gene prediction and comparison to genome annotation.

Bioinformatics. 2005 Dec 15;21(24):4322-9. doi: 10.1093/bioinformatics/bti701. Epub 2005 Oct 25.

Defining genes in the genome of the hyperthermophilic archaeon Pyrococcus furiosus: implications for all microbial genomes.

J Bacteriol. 2005 Nov;187(21):7325-32. doi: 10.1128/JB.187.21.7325-7332.2005.

Evolution of translational initiation: new insights from the archaea.

FEMS Microbiol Rev. 2005 Apr;29(2):185-200. doi: 10.1016/j.femsre.2004.10.002.

Divergent transcriptional and translational signals in Archaea.

Environ Microbiol. 2005 Jan;7(1):47-54. doi: 10.1111/j.1462-2920.2004.00674.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

原核生物基因组的转录起始位点（TIS）注释的计算评估

Computational evaluation of TIS annotation for prokaryotic genomes.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献