Craven M W, Mural R J, Hauser L J, Uberbacher E C
Computer Sciences Department, University of Wisconsin-Madison 53706, USA.
Proc Int Conf Intell Syst Mol Biol. 1995;3:98-106.
An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation--amino acid composition--and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.
分子生物学中一个重要的开放性问题是,仅根据蛋白质的一级序列,如何使用计算方法来理解其结构和功能。我们描述并评估了一种根据蛋白质序列的结构折叠类别对其进行分类的原创机器学习方法。我们的工作在几个方面具有新颖性:我们使用了一组以前未用于对一级序列进行分类的蛋白质类别,并且我们使用一组独特的属性向学习器表示蛋白质序列。我们通过测量其对不在训练集中的蛋白质进行正确分类的能力来评估我们的方法。我们将我们的输入表示与一种常用的输入表示——氨基酸组成——进行比较,并表明我们的方法能更准确地对与系统所训练序列同源性非常有限的蛋白质进行分类。