基于稀疏表示分类的蛋白质折叠识别

Protein fold recognition based on sparse representation based classification.

作者信息

Yan Ke, Xu Yong, Fang Xiaozhao, Zheng Chunhou, Liu Bin

机构信息

School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, 518055, China.

College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui, 230039, China.

出版信息

Artif Intell Med. 2017 Jun;79:1-8. doi: 10.1016/j.artmed.2017.03.006. Epub 2017 Mar 27.

DOI:10.1016/j.artmed.2017.03.006

PMID:28359635

Abstract

Knowledge of protein fold type is critical for determining the protein structure and function. Because of its importance, several computational methods for fold recognition have been proposed. Most of them are based on well-known machine learning techniques, such as Support Vector Machines (SVMs), Artificial Neural Network (ANN), etc. Although these machine learning methods play a role in stimulating the development of this important area, new techniques are still needed to further improve the predictive performance for fold recognition. Sparse Representation based Classification (SRC) has been widely used in image processing, and shows better performance than other related machine learning methods. In this study, we apply the SRC to solve the protein fold recognition problem. Experimental results on a widely used benchmark dataset show that the proposed method is able to improve the performance of some basic classifiers and three state-of-the-art methods to feature selection, including autocross-covariance (ACC) fold, D-D, and Bi-gram. Finally, we propose a novel computational predictor called MF-SRC for fold recognition by combining these three features into the framework of SRC to achieve further performance improvement. Compared with other computational methods in this field on DD dataset, EDD dataset and TG dataset, the proposed method achieves stable performance by reducing the influence of the noise in the dataset. It is anticipated that the proposed predictor may become a useful high throughput tool for large-scale fold recognition or at least, play a complementary role to the existing predictors in this regard.

摘要

蛋白质折叠类型的知识对于确定蛋白质的结构和功能至关重要。由于其重要性，已经提出了几种用于折叠识别的计算方法。其中大多数基于著名的机器学习技术，如支持向量机（SVM）、人工神经网络（ANN）等。尽管这些机器学习方法在推动这一重要领域的发展中发挥了作用，但仍需要新技术来进一步提高折叠识别的预测性能。基于稀疏表示的分类（SRC）已在图像处理中广泛使用，并且表现出比其他相关机器学习方法更好的性能。在本研究中，我们应用SRC来解决蛋白质折叠识别问题。在一个广泛使用的基准数据集上的实验结果表明，所提出的方法能够提高一些基本分类器以及包括自互协方差（ACC）折叠、D-D和二元语法在内的三种最先进的特征选择方法的性能。最后，我们通过将这三个特征组合到SRC框架中，提出了一种名为MF-SRC的新型折叠识别计算预测器，以实现进一步的性能提升。与该领域其他计算方法在DD数据集、EDD数据集和TG数据集上相比，所提出的方法通过减少数据集中噪声的影响实现了稳定的性能。预计所提出的预测器可能成为大规模折叠识别的有用高通量工具，或者至少在这方面与现有预测器起到互补作用。