从细菌和古细菌基因组中可疑开放阅读框进行基因识别。

Gene recognition from questionable ORFs in bacterial and archaeal genomes.

作者信息

Chen Ling-Ling, Zhang Chun-Ting

机构信息

Department of Physics, Tianjin University, Tianjin, 300072, China.

出版信息

J Biomol Struct Dyn. 2003 Aug;21(1):99-109. doi: 10.1080/07391102.2003.10506908.

DOI:10.1080/07391102.2003.10506908

PMID:12854962

Abstract

The ORFs of microbial genomes in annotation files are usually classified into two groups: the first corresponds to known genes; whereas the second includes 'putative', 'probable', 'conserved hypothetical', 'hypothetical', 'unknown' and 'predicted' ORFs etc. Since the annotation is not 100% accurate, it is essential to confirm which ORF of the latter group is coding and which is not. Starting from known genes in the former, this paper describes an improved Z curve method to recognize genes in the latter. Ten-fold cross-validation tests show that the average accuracy of the algorithm is greater than 99% for recognizing the known genes in 57 bacterial and archaeal genomes. The method is then applied to recognize genes of the latter group. The likely non-coding ORFs in each of the 57 bacterial or archaeal genomes studied here are recognized and listed at the website http://tubic.tju.edu.cn/ZCURVE_C_html/noncoding.html. The working mechanism of the algorithm has been discussed in details. A computer program, called ZCURVE_C, was written to calculate a coding score called Z-curve score for ORFs in the above 57 bacterial and archaeal genomes. Coding/non-coding is simply determined by the criterion of Z-curve score > 0/ Z-curve score < 0. A website has been set up to provide the service to calculate the Z-curve score. A user may submit the DNA sequence of an ORF to the server at http://tubic.tju.edu.cn/ZCURVE_C/Default.cgi, and the Z-curve score of the ORF is calculated and returned to the user immediately.

摘要

注释文件中微生物基因组的开放阅读框（ORF）通常分为两类：第一类对应已知基因；而第二类包括“假定的”“可能的”“保守假设的”“假设的”“未知的”和“预测的”开放阅读框等。由于注释并非100%准确，因此必须确定后一组中的哪些开放阅读框正在编码，哪些没有编码。本文从前一组中的已知基因出发，描述了一种改进的Z曲线方法来识别后一组中的基因。十倍交叉验证测试表明，该算法识别57个细菌和古细菌基因组中已知基因的平均准确率大于99%。然后将该方法应用于识别后一组中的基因。本文研究的57个细菌或古细菌基因组中每个基因组可能的非编码开放阅读框已被识别，并在网站http://tubic.tju.edu.cn/ZCURVE_C_html/noncoding.html上列出。该算法的工作机制已详细讨论。编写了一个名为ZCURVE_C的计算机程序，用于计算上述57个细菌和古细菌基因组中开放阅读框的编码得分，即Z曲线得分。编码/非编码简单地由Z曲线得分>0/Z曲线得分<0的标准确定。已建立一个网站来提供计算Z曲线得分的服务。用户可以将开放阅读框的DNA序列提交到http://tubic.tju.edu.cn/ZCURVE_C/Default.cgi的服务器，开放阅读框的Z曲线得分会立即计算出来并返回给用户。