整合完全测序细菌基因组中基因识别的多种证据。

Combining diverse evidence for gene recognition in completely sequenced bacterial genomes.

作者信息

Frishman D, Mironov A, Mewes H W, Gelfand M

机构信息

Munich Information Center for Protein Sequences (MIPS) of the German National Center for Health and Environment (GSF), Am Klopferspitz 18a, 82152 Martinsried, Germany.

出版信息

Nucleic Acids Res. 1998 Jun 15;26(12):2941-7. doi: 10.1093/nar/26.12.2941.

DOI:10.1093/nar/26.12.2941

PMID:9611239

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC147632/

Abstract

Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.

摘要

对新测序的细菌基因组进行分析始于蛋白质编码基因的识别。蛋白质的功能分配需要准确了解蛋白质的N端。我们提出了一个新程序ORPHEUS，它可以识别候选基因并准确预测基因起始位点。分析从数据库相似性搜索和可靠基因片段的识别开始。后者用于推导蛋白质编码区和核糖体结合位点的统计特征，并预测分析基因组中的完整基因集。在对枯草芽孢杆菌和大肠杆菌基因组的测试中，该程序正确识别了PIR国际数据库中描述的93.3%（分别为96.3%）长度超过100个密码子的实验注释基因，对于这些基因，96.3%（83.9%）的起始位点被准确预测。此外，发现了GenBank中注释的98.9%（99.1%）长度超过100个密码子的基因，92.9%（75.7%）的预测起始位点与特征表描述一致。最后，对于枯草芽孢杆菌和大肠杆菌的完整基因互补体，包括长度小于100个密码子的基因，基因预测准确率分别为88.9%和87.1%，94.2%和76.7%的起始位点与现有注释一致。

相似文献

Combining diverse evidence for gene recognition in completely sequenced bacterial genomes.整合完全测序细菌基因组中基因识别的多种证据。

Nucleic Acids Res. 1998 Jun 15;26(12):2941-7. doi: 10.1093/nar/26.12.2941.

GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.GeneMarkS：一种用于预测微生物基因组中基因起始位点的自训练方法。对在调控区域中寻找序列基序的启示。

Nucleic Acids Res. 2001 Jun 15;29(12):2607-18. doi: 10.1093/nar/29.12.2607.

Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes.通过“逐帧”算法寻找原核生物基因：靶向基因起始位点和重叠基因。

Bioinformatics. 1999 Nov;15(11):874-86. doi: 10.1093/bioinformatics/15.11.874.

GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.基因查找：一种适用于原核生物序列自动注释的新型从头基因识别系统。

Gene. 2005 Feb 14;346:115-25. doi: 10.1016/j.gene.2004.10.018. Epub 2005 Jan 26.

Reannotation of Shewanella oneidensis genome.嗜铁素还原地杆菌基因组的重新注释

OMICS. 2003 Summer;7(2):171-5. doi: 10.1089/153623103322246566.

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics.通过蛋白质基因组学鉴定原核基因组全部蛋白质编码潜能的综合策略。

Genome Res. 2017 Dec;27(12):2083-2095. doi: 10.1101/gr.218255.116. Epub 2017 Nov 15.

Accuracy improvement for identifying translation initiation sites in microbial genomes.提高微生物基因组中翻译起始位点识别的准确性。

Bioinformatics. 2004 Dec 12;20(18):3308-17. doi: 10.1093/bioinformatics/bth390. Epub 2004 Jul 9.

Bacterial start site prediction.细菌起始位点预测。

Nucleic Acids Res. 1999 Sep 1;27(17):3577-82. doi: 10.1093/nar/27.17.3577.

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes.基因组微生物编码序列的重新注释：发现新基因和注释不准确的基因。

BMC Bioinformatics. 2002;3:5. doi: 10.1186/1471-2105-3-5. Epub 2002 Feb 5.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

引用本文的文献

Analysis of metagenomic data.宏基因组数据的分析

Nat Rev Methods Primers. 2025;5. doi: 10.1038/s43586-024-00376-6. Epub 2025 Jan 23.

Gene Prediction in Metagenomic Fragments with Deep Learning.利用深度学习进行宏基因组片段中的基因预测

Biomed Res Int. 2017;2017:4740354. doi: 10.1155/2017/4740354. Epub 2017 Nov 8.

A Primer on Infectious Disease Bacterial Genomics.传染病细菌基因组学入门

Clin Microbiol Rev. 2016 Oct;29(4):881-913. doi: 10.1128/CMR.00001-16. Epub 2016 Sep 7.

Complete Genome Sequence of Xanthomonas campestris pv. campestris Strain 17 from Taiwan.来自台湾的野油菜黄单胞菌野油菜致病变种菌株17的全基因组序列

Genome Announc. 2015 Dec 17;3(6):e01466-15. doi: 10.1128/genomeA.01466-15.

SearchDOGS bacteria, software that provides automated identification of potentially missed genes in annotated bacterial genomes.SearchDOGS 细菌，这是一款软件，它提供了对注释细菌基因组中潜在遗漏基因的自动识别。

J Bacteriol. 2014 Jun;196(11):2030-42. doi: 10.1128/JB.01368-13. Epub 2014 Mar 21.

Jaccard index based similarity measure to compare transcription factor binding site models.基于杰卡德指数的相似度度量，用于比较转录因子结合位点模型。

Algorithms Mol Biol. 2013 Sep 30;8(1):23. doi: 10.1186/1748-7188-8-23.

Gene prediction in metagenomic fragments based on the SVM algorithm.基于 SVM 算法的宏基因组片段基因预测。

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S12. doi: 10.1186/1471-2105-14-S5-S12. Epub 2013 Apr 10.

How many signal peptides are there in bacteria?细菌中有多少个信号肽？

Environ Microbiol. 2013 Apr;15(4):983-90. doi: 10.1111/1462-2920.12105.

The transcriptional landscape of the deep-sea bacterium Photobacterium profundum in both a toxR mutant and its parental strain.深海细菌 Photobacterium profundum 在 toxR 突变体及其亲本菌株中的转录组图谱。

BMC Genomics. 2012 Oct 29;13:567. doi: 10.1186/1471-2164-13-567.

A small predatory core genome in the divergent marine Bacteriovorax marinus SJ and the terrestrial Bdellovibrio bacteriovorus.海洋分歧杆菌 SJ 和陆地蛭弧菌中较小的捕食性核心基因组。

ISME J. 2013 Jan;7(1):148-60. doi: 10.1038/ismej.2012.90. Epub 2012 Sep 6.

本文引用的文献

Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction.从无注释的DNA序列推导核糖体结合位点（RBS）统计模型以及将RBS模型用于N端预测。

Pac Symp Biocomput. 1998:279-90.

Fast comparison of a DNA sequence with a protein sequence database.DNA序列与蛋白质序列数据库的快速比对。

Microb Comp Genomics. 1996;1(4):281-91. doi: 10.1089/mcg.1996.1.281.

GeneMark.hmm: new solutions for gene finding.基因标记隐马尔可夫模型：基因发现的新解决方案。

Nucleic Acids Res. 1998 Feb 15;26(4):1107-15. doi: 10.1093/nar/26.4.1107.

Microbial gene identification using interpolated Markov models.使用插值马尔可夫模型进行微生物基因识别。

Nucleic Acids Res. 1998 Jan 15;26(2):544-8. doi: 10.1093/nar/26.2.544.

Comparison of DNA sequences with protein sequences.DNA序列与蛋白质序列的比较。

Genomics. 1997 Nov 15;46(1):24-36. doi: 10.1006/geno.1997.4995.

The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998.1998年的SWISS-PROT蛋白质序列数据库及其补充数据库TrEMBL。

Nucleic Acids Res. 1998 Jan 1;26(1):38-42. doi: 10.1093/nar/26.1.38.

The PIR-International Protein Sequence Database.国际蛋白质信息资源数据库。

Nucleic Acids Res. 1998 Jan 1;26(1):27-32. doi: 10.1093/nar/26.1.27.

The complete genome sequence of the gram-positive bacterium Bacillus subtilis.革兰氏阳性细菌枯草芽孢杆菌的全基因组序列。

Nature. 1997 Nov 20;390(6657):249-56. doi: 10.1038/36786.

The complete genome sequence of Escherichia coli K-12.大肠杆菌K-12的全基因组序列。

Science. 1997 Sep 5;277(5331):1453-62. doi: 10.1126/science.277.5331.1453.

Prediction of complete gene structures in human genomic DNA.人类基因组DNA中完整基因结构的预测。

J Mol Biol. 1997 Apr 25;268(1):78-94. doi: 10.1006/jmbi.1997.0951.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。