使用决策树算法在人类DNA中定位蛋白质编码区域。

Locating protein coding regions in human DNA using a decision tree algorithm.

作者信息

Salzberg S

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.

出版信息

J Comput Biol. 1995 Fall;2(3):473-85. doi: 10.1089/cmb.1995.2.473.

DOI:10.1089/cmb.1995.2.473

PMID:8521276

Abstract

Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

摘要

真核生物DNA中的基因涵盖数百或数千个碱基对，而那些编码蛋白质的基因区域可能仅占序列的一小部分。识别编码区域对于理解这些基因至关重要。最近许多研究致力于研究区分编码区和非编码区的计算方法，并已报道了一些有前景的结果。我们在此描述一种新方法，该方法使用一个从数据构建决策树的机器学习系统。这种方法结合了多种编码度量，以生成在长度为54至162个碱基对的DNA序列上，准确率始终高于先前方法的分类器。该算法非常高效，并且可以轻松适应不同的序列长度。我们的结论是，决策树是识别蛋白质编码区域的高效工具。