Hoseini Adele Sadat Haghighat, Mirzarezaee Mitra
Department of Computer Engineering, Science and Research branch, Islamic Azad University, Tehran, Iran.
School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
Iran J Biotechnol. 2018 Aug 11;16(3):e1933. doi: 10.15171/ijb.1933. eCollection 2018 Aug.
Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from protein sequences. In contrast, protein interactions have been less investigated.
As protein interactions usually occur in the same or adjacent places, using this feature to find the location would be efficient and impressive. This study did not aim at increasing the total accuracy of the conducted research. The study has focused on the features of the proteins' interaction and their employment which lead to a higher accuracy.
In this study, we have examined the protein interaction network as one of the features for prediction of the protein localization and its effects on the prediction results. In this regards, we have gathered some of the most common features including Amino Acid Composition, Dipeptide Compositions, Pseudo Amino Acid Compositions (PseAAC), Position Specific Scoring Matrix (PSSM), Functional Domain, Gene Ontology information, and the Pair-wise sequence alignment. The results of the classification are compared to the ones using protein interactions. For achieving this goal different machine learning algorithms were tested.
The best-obtained results of using single feature set obtained using SVM classifier for PseAAC feature. The accuracy of combining all features with PPI data, using the Decision Tree and Random Forest classifiers, was 82.49% and 83.35%, respectively. In another experiment, using just protein interaction data with the different cutting points resulted in obtaining an accuracy of 93.035% for the protein location prediction.
In total, it was shown that protein(s) interaction has a significant impact on the prediction of the mitochondrial proteins' location. This feature can separately distinguish the locations well. Using this feature the accuracy of the results is raised up to 5%.
蛋白质定位预测是生物信息学中最重要的问题之一,用于预测细胞和细胞器(如线粒体)中的蛋白质。在本研究中,应用了几种机器学习算法来预测细胞内蛋白质的位置。这些算法使用从蛋白质序列中提取的特征。相比之下,蛋白质相互作用的研究较少。
由于蛋白质相互作用通常发生在相同或相邻的位置,利用这一特征来确定位置将是高效且令人印象深刻的。本研究并非旨在提高所开展研究的总体准确性。该研究专注于蛋白质相互作用的特征及其应用,从而实现更高的准确性。
在本研究中,我们将蛋白质相互作用网络作为预测蛋白质定位的特征之一进行了研究,并探讨了其对预测结果的影响。为此,我们收集了一些最常见的特征,包括氨基酸组成、二肽组成、伪氨基酸组成(PseAAC)、位置特异性评分矩阵(PSSM)、功能域、基因本体信息以及成对序列比对。将分类结果与使用蛋白质相互作用的结果进行比较。为实现这一目标,测试了不同的机器学习算法。
使用支持向量机(SVM)分类器对PseAAC特征获得的单特征集取得了最佳结果。使用决策树和随机森林分类器将所有特征与蛋白质-蛋白质相互作用(PPI)数据相结合时,准确率分别为82.49%和83.35%。在另一项实验中,仅使用具有不同切点的蛋白质相互作用数据,蛋白质定位预测的准确率达到了93.035%。
总体而言,研究表明蛋白质相互作用对线粒体蛋白质定位的预测有显著影响。这一特征能够很好地分别区分不同位置。利用这一特征,结果的准确率提高了5%。