Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.
Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
Sci Rep. 2022 May 11;12(1):7697. doi: 10.1038/s41598-022-11897-z.
Amyloid proteins have the ability to form insoluble fibril aggregates that have important pathogenic effects in many tissues. Such amyloidoses are prominently associated with common diseases such as type 2 diabetes, Alzheimer's disease, and Parkinson's disease. There are many types of amyloid proteins, and some proteins that form amyloid aggregates when in a misfolded state. It is difficult to identify such amyloid proteins and their pathogenic properties, but a new and effective approach is by developing effective bioinformatics tools. While several machine learning (ML)-based models for in silico identification of amyloid proteins have been proposed, their predictive performance is limited. In this study, we present AMYPred-FRL, a novel meta-predictor that uses a feature representation learning approach to achieve more accurate amyloid protein identification. AMYPred-FRL combined six well-known ML algorithms (extremely randomized tree, extreme gradient boosting, k-nearest neighbor, logistic regression, random forest, and support vector machine) with ten different sequence-based feature descriptors to generate 60 probabilistic features (PFs), as opposed to state-of-the-art methods developed by a single feature-based approach. A logistic regression recursive feature elimination (LR-RFE) method was used to find the optimal m number of 60 PFs in order to improve the predictive performance. Finally, using the meta-predictor approach, the 20 selected PFs were fed into a logistic regression method to create the final hybrid model (AMYPred-FRL). Both cross-validation and independent tests showed that AMYPred-FRL achieved superior predictive performance than its constituent baseline models. In an extensive independent test, AMYPred-FRL outperformed the existing methods by 5.5% and 16.1%, respectively, with accuracy and MCC of 0.873 and 0.710. To expedite high-throughput prediction, a user-friendly web server of AMYPred-FRL is freely available at http://pmlabstack.pythonanywhere.com/AMYPred-FRL . It is anticipated that AMYPred-FRL will be a useful tool in helping researchers to identify new amyloid proteins.
淀粉样蛋白具有形成不溶性纤维状聚集物的能力,这些聚集物在许多组织中具有重要的致病作用。这种淀粉样变性与 2 型糖尿病、阿尔茨海默病和帕金森病等常见疾病密切相关。淀粉样蛋白有很多种,有些蛋白质在错误折叠状态下会形成淀粉样聚集物。识别这些淀粉样蛋白及其致病特性具有一定难度,但一种新的有效方法是开发有效的生物信息学工具。虽然已经提出了几种基于机器学习 (ML) 的淀粉样蛋白计算识别模型,但它们的预测性能有限。在这项研究中,我们提出了 AMYPred-FRL,这是一种新的元预测器,它使用特征表示学习方法来实现更准确的淀粉样蛋白识别。AMYPred-FRL 将六个著名的 ML 算法(极端随机树、极端梯度提升、k 最近邻、逻辑回归、随机森林和支持向量机)与十种不同的基于序列的特征描述符相结合,生成 60 个概率特征 (PFs),而不是由单一特征方法开发的最先进方法。逻辑回归递归特征消除 (LR-RFE) 方法用于找到最优 m 个 60PFs,以提高预测性能。最后,使用元预测器方法,将 20 个选定的 PF 输入逻辑回归方法,创建最终的混合模型 (AMYPred-FRL)。交叉验证和独立测试均表明,AMYPred-FRL 的预测性能优于其组成的基准模型。在广泛的独立测试中,AMYPred-FRL 的准确性和 MCC 分别比现有方法高出 5.5%和 16.1%,达到 0.873 和 0.710。为了加速高通量预测,我们在 http://pmlabstack.pythonanywhere.com/AMYPred-FRL 上免费提供了 AMYPred-FRL 的用户友好型网络服务器。预计 AMYPred-FRL 将成为帮助研究人员识别新的淀粉样蛋白的有用工具。