School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
Amino Acids. 2013 Feb;44(2):573-80. doi: 10.1007/s00726-012-1374-z. Epub 2012 Aug 1.
The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm .
成功预测嗜热蛋白对于设计在高温下具有功能的稳定酶非常有用。我们使用增量多样性(ID),一种新颖的基于氨基酸组成的相似性距离,在 2 类 K-最近邻分类器中对嗜热蛋白和嗜中温蛋白进行分类。并且成功开发了 KNN-ID 分类器来预测嗜热蛋白。与之前的方法不同,我们的方法不是从蛋白质序列中提取特征,而是基于符号序列的多样性度量。首先计算每对蛋白质序列之间的相似距离,以定量测量给定序列与其他序列的相似程度。然后使用 K-最近邻算法确定查询蛋白质。与多个最近发表的方法进行比较表明,本研究中提出的 KNN-ID 优于其他方法。改进的预测性能表明,它是一种用于区分嗜热蛋白和嗜中温蛋白的简单有效的分类器。最后,进一步讨论了蛋白质长度和蛋白质同一性对预测准确性的影响。本文中使用的预测模型和数据集可从 http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm 免费下载。