MED：一种用于细菌和古细菌基因组的新型无监督基因预测算法。

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

作者信息

Zhu Huaiqiu, Hu Gang-Qing, Yang Yi-Fan, Wang Jin, She Zhen-Su

机构信息

State Key Lab for Turbulence and Complex Systems and Department of Biomedical Engineering, Peking University, Beijing 100871, China.

出版信息

BMC Bioinformatics. 2007 Mar 16;8:97. doi: 10.1186/1471-2105-8-97.

DOI:10.1186/1471-2105-8-97

PMID:17367537

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1847833/

Abstract

BACKGROUND

Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes.

RESULTS

This paper describes a new prokaryotic genefinding algorithm based on a comprehensive statistical model of protein coding Open Reading Frames (ORFs) and Translation Initiation Sites (TISs). The former is based on a linguistic "Entropy Density Profile" (EDP) model of coding DNA sequence and the latter comprises several relevant features related to the translation initiation. They are combined to form a so-called Multivariate Entropy Distance (MED) algorithm, MED 2.0, that incorporates several strategies in the iterative program. The iterations enable us to develop a non-supervised learning process and to obtain a set of genome-specific parameters for the gene structure, before making the prediction of genes.

CONCLUSION

Results of extensive tests show that MED 2.0 achieves a competitive high performance in the gene prediction for both 5' and 3' end matches, compared to the current best prokaryotic gene finders. The advantage of the MED 2.0 is particularly evident for GC-rich genomes and archaeal genomes. Furthermore, the genome-specific parameters given by MED 2.0 match with the current understanding of prokaryotic genomes and may serve as tools for comparative genomic studies. In particular, MED 2.0 is shown to reveal divergent translation initiation mechanisms in archaeal genomes while making a more accurate prediction of TISs compared to the existing gene finders and the current GenBank annotation.

摘要

背景

尽管在细菌和古菌基因的计算预测方面取得了显著成功，但对原核生物基因结构缺乏全面了解阻碍了对基因组间差异的进一步阐明。开发新的从头算法不仅能准确预测基因，还能促进原核生物基因组的比较研究，这仍然很有意义。

结果

本文描述了一种基于蛋白质编码开放阅读框（ORF）和翻译起始位点（TIS）综合统计模型的新原核生物基因发现算法。前者基于编码DNA序列的语言“熵密度谱”（EDP）模型，后者包含与翻译起始相关的几个特征。它们被组合形成所谓的多变量熵距离（MED）算法，即MED 2.0，该算法在迭代程序中纳入了多种策略。这些迭代使我们能够开发一个无监督学习过程，并在预测基因之前获得一组基因结构的基因组特异性参数。

结论

广泛测试结果表明，与当前最佳的原核生物基因发现工具相比，MED 2.0在5'和3'端匹配的基因预测中实现了具有竞争力的高性能。MED 2.0的优势在富含GC的基因组和古菌基因组中尤为明显。此外，MED 2.0给出的基因组特异性参数与当前对原核生物基因组的理解相匹配，可作为比较基因组研究的工具。特别是，MED 2.0在揭示古菌基因组中不同的翻译起始机制方面表现出色，同时与现有基因发现工具和当前的GenBank注释相比，对TIS的预测更准确。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/625a/1847833/bc4cc69b0905/1471-2105-8-97-1.jpg

相似文献

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

BMC Bioinformatics. 2007 Mar 16;8:97. doi: 10.1186/1471-2105-8-97.

Accuracy improvement for identifying translation initiation sites in microbial genomes.

Bioinformatics. 2004 Dec 12;20(18):3308-17. doi: 10.1093/bioinformatics/bth390. Epub 2004 Jul 9.

An unsupervised classification scheme for improving predictions of prokaryotic TIS.

BMC Bioinformatics. 2006 Mar 9;7:121. doi: 10.1186/1471-2105-7-121.

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Nucleic Acids Res. 2001 Jun 15;29(12):2607-18. doi: 10.1093/nar/29.12.2607.

[Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods].

Yi Chuan. 2020 Jul 20;42(7):691-702. doi: 10.16288/j.yczz.20-022.

A universal operon predictor for prokaryotic genomes.

J Bioinform Comput Biol. 2009 Feb;7(1):19-38. doi: 10.1142/s0219720009003984.

Gene recognition from questionable ORFs in bacterial and archaeal genomes.

J Biomol Struct Dyn. 2003 Aug;21(1):99-109. doi: 10.1080/07391102.2003.10506908.

Prokaryotic gene prediction using GeneMark and GeneMark.hmm.

Curr Protoc Bioinformatics. 2003 May;Chapter 4:Unit4.5. doi: 10.1002/0471250953.bi0405s01.

PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes.

BMC Bioinformatics. 2008 Mar 26;9:170. doi: 10.1186/1471-2105-9-170.

Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite.

Curr Protoc Bioinformatics. 2011 Sep;Chapter 4:4.5.1-4.5.17. doi: 10.1002/0471250953.bi0405s35.

引用本文的文献

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes.

Microorganisms. 2021 Jan 8;9(1):129. doi: 10.3390/microorganisms9010129.

LncLocation: Efficient Subcellular Location Prediction of Long Non-Coding RNA-Based Multi-Source Heterogeneous Feature Fusion.

Int J Mol Sci. 2020 Oct 1;21(19):7271. doi: 10.3390/ijms21197271.

DeepRibo: a neural network for precise gene annotation of prokaryotes by combining ribosome profiling signal and binding site patterns.

Nucleic Acids Res. 2019 Apr 8;47(6):e36. doi: 10.1093/nar/gkz061.

A Novel Quality Measure and Correction Procedure for the Annotation of Microbial Translation Initiation Sites.

PLoS One. 2015 Jul 23;10(7):e0133691. doi: 10.1371/journal.pone.0133691. eCollection 2015.

Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm.

Biol Direct. 2013 Sep 25;8:23. doi: 10.1186/1745-6150-8-23.

Gene prediction in metagenomic fragments based on the SVM algorithm.

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S12. doi: 10.1186/1471-2105-14-S5-S12. Epub 2013 Apr 10.

DNA-energetics-based analyses suggest additional genes in prokaryotes.

J Biosci. 2012 Jul;37(3):433-44. doi: 10.1007/s12038-012-9221-7.

Exploration of multivariate analysis in microbial coding sequence modeling.

BMC Bioinformatics. 2012 May 14;13:97. doi: 10.1186/1471-2105-13-97.

Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes.

BMC Genomics. 2011 Jul 12;12:361. doi: 10.1186/1471-2164-12-361.

Identification of prokaryotic small proteins using a comparative genomic approach.

Bioinformatics. 2011 Jul 1;27(13):1765-71. doi: 10.1093/bioinformatics/btr275. Epub 2011 May 5.

本文引用的文献

Large-scale prokaryotic gene prediction and comparison to genome annotation.

Bioinformatics. 2005 Dec 15;21(24):4322-9. doi: 10.1093/bioinformatics/bti701. Epub 2005 Oct 25.

Horizontal gene transfer, genome innovation and evolution.

Nat Rev Microbiol. 2005 Sep;3(9):679-87. doi: 10.1038/nrmicro1204.

Gene recognition based on nucleotide distribution of ORFs in a hyper-thermophilic crenarchaeon, Aeropyrum pernix K1.

DNA Res. 2004 Dec 31;11(6):361-70. doi: 10.1093/dnares/11.6.361.

Evolution of translational initiation: new insights from the archaea.

FEMS Microbiol Rev. 2005 Apr;29(2):185-200. doi: 10.1016/j.femsre.2004.10.002.

GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.

Gene. 2005 Feb 14;346:115-25. doi: 10.1016/j.gene.2004.10.018. Epub 2005 Jan 26.

Divergent transcriptional and translational signals in Archaea.

Environ Microbiol. 2005 Jan;7(1):47-54. doi: 10.1111/j.1462-2920.2004.00674.x.

Multivariate entropy distance method for prokaryotic gene identification.

J Bioinform Comput Biol. 2004 Jun;2(2):353-73. doi: 10.1142/s0219720004000624.

Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory.

Brief Bioinform. 2004 Jun;5(2):118-30. doi: 10.1093/bib/5.2.118.

Accuracy improvement for identifying translation initiation sites in microbial genomes.

Bioinformatics. 2004 Dec 12;20(18):3308-17. doi: 10.1093/bioinformatics/bth390. Epub 2004 Jul 9.

Compositional nonrandomness upstream of start codons in archaebacteria.

Gene. 2004 May 12;332:89-95. doi: 10.1016/j.gene.2004.02.022.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MED：一种用于细菌和古细菌基因组的新型无监督基因预测算法。

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献