Metfessel B A, Saurugger P N, Connelly D P, Rich S S
Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis 55455.
Protein Sci. 1993 Jul;2(7):1171-82. doi: 10.1002/pro.5560020712.
We present an approach to predicting protein structural class that uses amino acid composition and hydrophobic pattern frequency information as input to two types of neural networks: (1) a three-layer back-propagation network and (2) a learning vector quantization network. The results of these methods are compared to those obtained from a modified Euclidean statistical clustering algorithm. The protein sequence data used to drive these algorithms consist of the normalized frequency of up to 20 amino acid types and six hydrophobic amino acid patterns. From these frequency values the structural class predictions for each protein (all-alpha, all-beta, or alpha-beta classes) are derived. Examples consisting of 64 previously classified proteins were randomly divided into multiple training (56 proteins) and test (8 proteins) sets. The best performing algorithm on the test sets was the learning vector quantization network using 17 inputs, obtaining a prediction accuracy of 80.2%. The Matthews correlation coefficients are statistically significant for all algorithms and all structural classes. The differences between algorithms are in general not statistically significant. These results show that information exists in protein primary sequences that is easily obtainable and useful for the prediction of protein structural class by neural networks as well as by standard statistical clustering algorithms.
我们提出了一种预测蛋白质结构类别的方法,该方法将氨基酸组成和疏水模式频率信息作为输入,应用于两种类型的神经网络:(1)一个三层反向传播网络和(2)一个学习矢量量化网络。将这些方法的结果与通过改进的欧几里得统计聚类算法获得的结果进行比较。用于驱动这些算法的蛋白质序列数据由多达20种氨基酸类型的归一化频率和六种疏水氨基酸模式组成。根据这些频率值得出每种蛋白质(全α、全β或α-β类)的结构类预测。由64个先前分类的蛋白质组成的示例被随机分为多个训练集(56个蛋白质)和测试集(8个蛋白质)。在测试集上表现最佳的算法是使用17个输入的学习矢量量化网络,预测准确率为80.2%。所有算法和所有结构类别的马修斯相关系数均具有统计学意义。算法之间的差异一般无统计学意义。这些结果表明,蛋白质一级序列中存在易于获取的信息,这些信息对于通过神经网络以及标准统计聚类算法预测蛋白质结构类别很有用。