Suppr超能文献

ECAmyloid:一种基于集成学习和综合序列衍生特征的淀粉样蛋白预测器。

ECAmyloid: An amyloid predictor based on ensemble learning and comprehensive sequence-derived features.

机构信息

School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, 264209, China.

School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, 264209, China.

出版信息

Comput Biol Chem. 2023 Jun;104:107853. doi: 10.1016/j.compbiolchem.2023.107853. Epub 2023 Mar 23.

Abstract

Amyloid fibrils formed by the mis-aggregation of amyloid proteins can lead to neuronal degenerations in the Alzheimer's disease. Predicting amyloid proteins not only contributes to understanding physicochemical properties and formation mechanism of amyloid proteins, but also has significant implications in the amyloid disease treatment and the development of a new purpose for amyloid materials. In this study, an ensemble learning model with sequence-derived features, ECAmyloid, is proposed to identify amyloids. The sequence-derived features including Pseudo Position Specificity Score Matrix (Pse-PSSM), Split Amino Acid Composition (SAAC), Solvent Accessibility (SA), and Secondary Structure Information (SSI) are employed to incorporate sequence composition, evolutionary and structural information. The individual learners of the ensemble learning model are selected by an increment classifier selection strategy. The final prediction results are determined by voting of prediction results of multiple individual learners. In view of the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted to generate positive samples. To eliminate irrelevant features and redundant features, correlation-based feature subset (CFS) selection combined with a heuristic search strategy is performed to obtain the optimal feature subset. Experimental results indicate that the ensemble classifier achieves an accuracy of 98.29%, a sensitivity of 0.992, a specificity of 0.974 on the training dataset using the 10-fold cross validation, far higher than the results obtained by its individual learners. Compared with the original feature set, the accuracy, sensitivity, specificity, MCC, F1-score, G-Mean of the ensemble method trained by the optimal feature subset are improved by 1.05%, 0.012, 0.01, 0.021, 0.011 and 0.011, respectively. Moreover, the comparison results with existing methods on two same independent test datasets demonstrate that the proposed method is an effective and promising predictor for large-scale determination of amyloid proteins. The data and code used to develop ECAmyloid has been shared to Github, and can be freely downloaded at https://github.com/KOALA-L/ECAmyloid.git.

摘要

淀粉样纤维由淀粉样蛋白的错误聚集形成,可导致阿尔茨海默病中的神经元变性。预测淀粉样蛋白不仅有助于理解淀粉样蛋白的物理化学性质和形成机制,而且对淀粉样疾病的治疗和淀粉样材料的新用途的开发具有重要意义。在这项研究中,提出了一种基于序列衍生特征的集成学习模型 ECAmyloid,用于识别淀粉样蛋白。所使用的序列衍生特征包括伪位置特异性得分矩阵(Pse-PSSM)、分裂氨基酸组成(SAAC)、溶剂可及性(SA)和二级结构信息(SSI),以结合序列组成、进化和结构信息。集成学习模型的各个学习者是通过增量分类器选择策略选择的。最终的预测结果由多个个体学习者的预测结果投票决定。针对不平衡的基准数据集,采用合成少数过采样技术(SMOTE)生成阳性样本。为了消除不相关特征和冗余特征,采用基于相关性的特征子集(CFS)选择与启发式搜索策略相结合的方法,以获得最优特征子集。实验结果表明,在使用 10 折交叉验证时,集成分类器在训练数据集上的准确率为 98.29%,灵敏度为 0.992,特异性为 0.974,远高于其各个学习者的结果。与原始特征集相比,使用最优特征子集训练的集成方法的准确率、灵敏度、特异性、MCC、F1 分数、G-均值分别提高了 1.05%、0.012、0.01、0.021、0.011 和 0.011。此外,在两个相同的独立测试数据集上与现有方法的比较结果表明,该方法是一种有效的、有前途的大规模淀粉样蛋白识别方法。用于开发 ECAmyloid 的数据和代码已共享到 Github,可以在 https://github.com/KOALA-L/ECAmyloid.git 上免费下载。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验