Suppr超能文献

检测细菌基因组中基因的内在和外在方法。

Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.

作者信息

Borodovsky M, Rudd K E, Koonin E V

机构信息

School of Biology, Georgia Institute of Technology, Atlanta 30332-0230.

出版信息

Nucleic Acids Res. 1994 Nov 11;22(22):4756-67. doi: 10.1093/nar/22.22.4756.

Abstract

The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.

摘要

利用计算机辅助方法对来自EcoSeq6数据库的大肠杆菌基因组DNA序列的无注释区域进行了分析,这些区域共有1278个“基因间”序列,总长度为359279个碱基对,目的是识别潜在的未知基因。寻找新基因的策略包括两个关键要素:i)使用基于马尔可夫链模型的GeneMark方法预测表达的开放阅读框(ORF),该模型用于大肠杆菌DNA的编码和非编码区域;ii)使用基于BLAST算法的程序和用于基序识别的程序搜索蛋白质序列相似性。GeneMark预测共得到354个潜在的表达ORF。使用BLASTX和TBLASTN程序表明,位于大肠杆菌染色体无注释区域的208个ORF与其他蛋白质序列具有显著相似性。GeneMark和BLAST支持将182个ORF鉴定为可能的基因,占GeneMark“命中”数的51.4%,占BLAST“命中”数的87.5%。73个潜在的新基因,占GeneMark预测的20.6%,属于古老的保守蛋白质家族,包括真细菌和真核生物成员。这个值接近真细菌蛋白质中高度保守序列的总体比例,表明GeneMark预测的大多数潜在表达ORF,虽然没有显著的BLAST命中,但仍可能是真正的基因。BLAST搜索鉴定出的大多数潜在基因自EcoSeq6数据库发布以来已被描述,但到目前为止约有70个基因尚未被检测到。在这些新鉴定的基因中,有编码具有多种预测功能蛋白质的基因,包括脱氢酶、激酶、其他几种代谢酶、ATP酶、rRNA甲基转移酶、膜蛋白和不同类型的调节蛋白。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验