基于周式伪氨基酸组成通用形式，运用模糊K近邻算法鉴别外膜蛋白。

Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC.

作者信息

Hayat Maqsood, Khan Asifullah

机构信息

Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences, P.O. 45650, Nilore, Islamabad, Pakistan.

出版信息

Protein Pept Lett. 2012 Apr;19(4):411-21. doi: 10.2174/092986612799789387.

DOI:10.2174/092986612799789387

Abstract

Outer membrane proteins (OMPs) play important roles in cell biology. In addition, OMPs are targeted by multiple drugs. The identification of OMPs from genomic sequences and successful prediction of their secondary and tertiary structures is a challenging task due to short membrane-spanning regions with high variation in properties. Therefore, an effective and accurate silico method for discrimination of OMPs from their primary sequences is needed. In this paper, we have analyzed the performance of various machine learning mechanisms for discriminating OMPs such as: Genetic Programming, K-nearest Neighbor, and Fuzzy K-nearest Neighbor (Fuzzy K-NN) in conjunction with discrete methods such as: Amino acid composition, Amphiphilic Pseudo amino acid composition, Split amino acid composition (SAAC), and hybrid versions of these methods. The performance of the classifiers is evaluated by two datasets using 5-fold crossvalidation. After the simulation, we have observed that Fuzzy K-NN using SAAC based-features makes it quite effective in discriminating OMPs. Fuzzy K-NN achieves the highest success rates of 99.00% accuracy for discriminating OMPs from non-OMPs and 98.77% and 98.28% accuracies from α-helix membrane and globular proteins, respectively on dataset1. While on dataset2, Fuzzy K-NN achieves 99.55%, 99.90%, and 99.81% accuracies for discriminating OMPs from non- OMPs, α-helix membrane, and globular proteins, respectively. It is observed that the classification performance of our proposed method is satisfactory and is better than the existing methods. Thus, it might be an effective tool for high throughput innovation of OMPs.

摘要

外膜蛋白（OMPs）在细胞生物学中发挥着重要作用。此外，多种药物以OMPs为作用靶点。由于跨膜区域较短且性质变化很大，从基因组序列中识别OMPs并成功预测其二级和三级结构是一项具有挑战性的任务。因此，需要一种有效且准确的计算机方法来从其一级序列中区分OMPs。在本文中，我们分析了各种机器学习机制（如遗传编程、K近邻和模糊K近邻（Fuzzy K-NN））结合离散方法（如氨基酸组成、两亲性伪氨基酸组成、拆分氨基酸组成（SAAC）以及这些方法的混合版本）用于区分OMPs的性能。使用两个数据集通过5折交叉验证来评估分类器的性能。模拟后，我们观察到使用基于SAAC特征的模糊K近邻在区分OMPs方面非常有效。在数据集1上，模糊K近邻区分OMPs与非OMPs的准确率最高达到99.00%，区分α-螺旋膜蛋白和球状蛋白的准确率分别为98.77%和98.28%。而在数据集2上，模糊K近邻区分OMPs与非OMPs、α-螺旋膜蛋白和球状蛋白的准确率分别为99.55%、99.90%和99.81%。我们观察到所提出方法的分类性能令人满意且优于现有方法。因此，它可能是OMPs高通量创新的有效工具。