利用氨基酸的生物物理化学性质训练分类器以预测纤维形成肽基序的机器学习研究。

Machine learning study of classifiers trained with biophysiochemical properties of amino acids to predict fibril forming Peptide motifs.

作者信息

Kumaran Nair Smitha Sunil, Subba Reddy N V, Hareesha K S

机构信息

Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal University, Karnataka 576104, India.

出版信息

Protein Pept Lett. 2012 Sep;19(9):917-23. doi: 10.2174/092986612802084429.

DOI:10.2174/092986612802084429

PMID:22486618

Abstract

It is important to understand the cause of amyloid illnesses by predicting the short protein fragments capable of forming amyloid-like fibril motifs aiding in the discovery of sequence-targeted anti-aggregation drugs. It is extremely desirable to design computational tools to provide affordable in silico predictions owing to the limitations of molecular techniques for their identification. In this research article, we tried to study, from a machine learning perspective, the performance of several machine learning classifiers that use heterogenous features based on biochemical and biophysical properties of amino acids to discriminate between amyloidogenic and non-amyloidogenic regions in peptides. Four conventional machine learning classifiers namely Support Vector Machine, Neural network, Decision tree and Random forest were trained and tested to find the best classifier that fits the problem domain well. Prior to classification, novel implementations of two biologically-inspired feature optimization techniques based on evolutionary algorithms and methodologies that mimic social life and a multivariate method based on projection are utilized in order to remove the unimportant and uninformative features. Among the dimenionality reduction algorithms considered under the study, prediction results show that algorithms based on evolutionary computation is the most effective. SVM best suits the problem domain in its fitment among the classifiers considered. The best classifier is also compared with an online predictor to evidence the equilibrium maintained between true positive rates and false positive rates in the proposed classifier. This exploratory study suggests that these methods are promising in providing amyloidogenity prediction and may be further extended for large-scale proteomic studies.

摘要

通过预测能够形成淀粉样纤维基序的短蛋白质片段来了解淀粉样疾病的病因，这对于发现序列靶向抗聚集药物很重要。由于分子技术在识别方面的局限性，设计计算工具以提供经济实惠的计算机模拟预测是非常必要的。在这篇研究文章中，我们试图从机器学习的角度研究几种机器学习分类器的性能，这些分类器使用基于氨基酸生化和生物物理特性的异构特征来区分肽中的淀粉样生成区域和非淀粉样生成区域。对支持向量机、神经网络、决策树和随机森林这四种传统机器学习分类器进行了训练和测试，以找到最适合该问题领域的分类器。在分类之前，利用基于进化算法和模拟社会生活的方法的两种受生物启发的特征优化技术的新实现以及基于投影的多变量方法，以去除不重要和无信息的特征。在所研究的降维算法中，预测结果表明基于进化计算的算法是最有效的。在考虑的分类器中，支持向量机在拟合方面最适合该问题领域。还将最佳分类器与在线预测器进行比较，以证明所提出的分类器在真阳性率和假阳性率之间保持的平衡。这项探索性研究表明，这些方法在提供淀粉样变性预测方面很有前景，并且可能会进一步扩展用于大规模蛋白质组学研究。