Kavianpour Hamidreza, Vasighi Mahdi
Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), 45137-66731, Zanjan, Iran.
Amino Acids. 2017 Feb;49(2):261-271. doi: 10.1007/s00726-016-2354-5. Epub 2016 Oct 24.
Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.
如今,了解蛋白质的细胞属性在药学、医学和分子生物学中具有重要作用。这些属性与蛋白质的功能和三维结构密切相关。蛋白质结构类别的知识被用于各种方法,以更好地理解蛋白质的功能和折叠模式。计算方法和智能系统在进行蛋白质结构分类方面可以发挥重要作用。大多数蛋白质序列作为字符和字符串保存在数据库中,而数值表示对于应用机器学习方法至关重要。在这项工作中,基于根据周围疏水性指数简化的氨基酸字母表,引入了蛋白质序列的二进制表示。这些长二进制序列中隐藏的许多重要特征可以通过它们的细胞自动机图像清晰地显示出来。从这些图像中提取的特征用于通过支持向量机构建分类模型。与之前在几个基准数据集上的研究相比,通过十折交叉验证获得的有前景的分类率表明,当前方法有助于揭示隐藏在蛋白质序列中的一些固有特征,并提高预测蛋白质结构类别的质量。