机器学习方法可替代 3D 构象分析法用于淀粉样六肽的分类。

Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.

机构信息

Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, 50-370 Wroclaw, Poland.

出版信息

BMC Bioinformatics. 2013 Jan 17;14:21. doi: 10.1186/1471-2105-14-21.

DOI:10.1186/1471-2105-14-21

PMID:23327628

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3566972/

Abstract

BACKGROUND

Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods.

RESULTS

We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning).

CONCLUSIONS

We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.

摘要

背景

淀粉样蛋白是能够形成纤维的蛋白质。许多淀粉样蛋白是严重疾病（如阿尔茨海默病）的基础。淀粉样蛋白相关疾病的数量在不断增加。最近的研究表明，淀粉样蛋白的特性可能与短的氨基酸片段有关，这些片段在暴露时会改变结构。已经在实验中发现了几百种这样的肽。目前，对所有可能的氨基酸组合进行实验测试是不可行的。相反，可以通过计算方法进行预测。3D 轮廓是一种基于物理化学的方法，它生成了最多的数据-ZipperDB。然而，它的计算量非常大。在这里，我们展示了数据集生成可以加速。提出并测试了两种提高淀粉样蛋白候选物分类效率的方法：简化的 3D 轮廓生成和机器学习方法。

结果

我们使用更经济的 3D 轮廓算法生成了一个新的六肽数据集，该算法与 ZipperDB 的分类重叠非常好（93.5%）。我们数据集的新部分包含 1779 个片段，其中 204 个被归类为淀粉样蛋白。基于片段能量的 6 个残基序列数据集及其二进制分类被应用于训练机器学习方法。ZipperDB 的一个单独序列集被用作测试集。最有效的方法是交替决策树和多层感知器。这两种方法的 ROC 曲线下面积均为 0.96，准确率为 91%，真阳性率约为 78%，真阴性率为 95%。其他几种机器学习方法也取得了较好的效果。计算时间从 18-20 个 CPU 小时（完整的 3D 轮廓）减少到 0.5 个 CPU 小时（简化的 3D 轮廓）到秒（机器学习）。

结论

我们表明，简化的轮廓生成方法不会引入与原始方法相比的错误，同时提高了计算效率。我们的新数据集足够有代表性，仅使用六个字母序列就可以使用简单的统计方法来测试淀粉样蛋白。统计机器学习方法（如交替决策树和多层感知器）可以替代基于能量的分类器，具有计算时间显著减少且易于执行分析的优势。此外，决策树提供了一组非常易于解释的规则。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ffa9/3566972/ea2a869c1876/1471-2105-14-21-1.jpg

相似文献

Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.机器学习方法可替代 3D 构象分析法用于淀粉样六肽的分类。

BMC Bioinformatics. 2013 Jan 17;14:21. doi: 10.1186/1471-2105-14-21.

FISH Amyloid - a new method for finding amyloidogenic segments in proteins based on site specific co-occurrence of aminoacids.FISH 淀粉样变——一种基于氨基酸特定共现的发现蛋白质中淀粉样肽段的新方法。

BMC Bioinformatics. 2014 Feb 24;15:54. doi: 10.1186/1471-2105-15-54.

On the amyloid datasets used for training PAFIG--how (not) to extend the experimental dataset of hexapeptides.用于训练 PAFIG 的淀粉样蛋白数据集——如何（不）扩展六肽的实验数据集。

BMC Bioinformatics. 2013 Dec 4;14:351. doi: 10.1186/1471-2105-14-351.

Machine learning study of classifiers trained with biophysiochemical properties of amino acids to predict fibril forming Peptide motifs.利用氨基酸的生物物理化学性质训练分类器以预测纤维形成肽基序的机器学习研究。

Protein Pept Lett. 2012 Sep;19(9):917-23. doi: 10.2174/092986612802084429.

Exploiting heterogeneous features to improve in silico prediction of peptide status - amyloidogenic or non-amyloidogenic.挖掘异质特征以提高肽状态（淀粉样变性或非淀粉样变性）的计算预测。

BMC Bioinformatics. 2011;12 Suppl 13(Suppl 13):S21. doi: 10.1186/1471-2105-12-S13-S21. Epub 2011 Nov 30.

Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。

Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.

Optimizing neural networks for medical data sets: A case study on neonatal apnea prediction.优化神经网络在医学数据集上的应用：以新生儿呼吸暂停预测为例的研究

Artif Intell Med. 2019 Jul;98:59-76. doi: 10.1016/j.artmed.2019.07.008. Epub 2019 Jul 25.

Prediction of periventricular leukomalacia. Part I: Selection of hemodynamic features using logistic regression and decision tree algorithms.脑室周围白质软化症的预测。第一部分：使用逻辑回归和决策树算法选择血流动力学特征。

Artif Intell Med. 2009 Jul;46(3):201-15. doi: 10.1016/j.artmed.2008.12.005. Epub 2009 Jan 21.

Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies.使用简单的人工智能方法预测抗体中的淀粉样变性。

BMC Bioinformatics. 2010 Feb 8;11:79. doi: 10.1186/1471-2105-11-79.

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.基于统计几何学，使用随机森林和神经模糊分类器预测非同义单核苷酸多态性的功能效应

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

引用本文的文献

Stabilization challenges and aggregation in protein-based therapeutics in the pharmaceutical industry.制药行业中基于蛋白质的治疗药物的稳定性挑战与聚集

RSC Adv. 2023 Dec 11;13(51):35947-35963. doi: 10.1039/d3ra06476j. eCollection 2023 Dec 8.

Screening membraneless organelle participants with machine-learning models that integrate multimodal features.使用整合多模态特征的机器学习模型筛选无膜细胞器参与者。

Proc Natl Acad Sci U S A. 2022 Jun 14;119(24):e2115369119. doi: 10.1073/pnas.2115369119. Epub 2022 Jun 10.

On the Conformational Dynamics of β-Amyloid Forming Peptides: A Computational Perspective.β-淀粉样蛋白形成肽的构象动力学：计算视角

Front Bioeng Biotechnol. 2020 Jun 3;8:532. doi: 10.3389/fbioe.2020.00532. eCollection 2020.

Computational prediction of protein aggregation: Advances in proteomics, conformation-specific algorithms and biotechnological applications.蛋白质聚集的计算预测：蛋白质组学、构象特异性算法及生物技术应用的进展

Comput Struct Biotechnol J. 2020 Jun 10;18:1403-1413. doi: 10.1016/j.csbj.2020.05.026. eCollection 2020.

Amyloidogenic motifs revealed by n-gram analysis.N 元分析揭示的淀粉样肽生成基序。

Sci Rep. 2017 Oct 11;7(1):12961. doi: 10.1038/s41598-017-13210-9.

Use of a Novel Grammatical Inference Approach in Classification of Amyloidogenic Hexapeptides.一种新型语法推理方法在淀粉样六肽分类中的应用。

Comput Math Methods Med. 2016;2016:1782732. doi: 10.1155/2016/1782732. Epub 2016 Mar 9.

Comparative modeling of hypothetical amyloid pores based on cylindrin.基于圆柱蛋白的假定淀粉样蛋白孔的比较建模

J Mol Model. 2015 Jun;21(6):151. doi: 10.1007/s00894-015-2691-4. Epub 2015 May 21.

BMC Bioinformatics. 2014 Feb 24;15:54. doi: 10.1186/1471-2105-15-54.

BMC Bioinformatics. 2013 Dec 4;14:351. doi: 10.1186/1471-2105-14-351.

本文引用的文献

STITCHER: Dynamic assembly of likely amyloid and prion β-structures from secondary structure predictions.拼接器：基于二级结构预测动态组装可能的淀粉样蛋白和朊病毒β结构。

Proteins. 2012 Feb;80(2):410-20. doi: 10.1002/prot.23203. Epub 2011 Nov 17.

A method for probing the mutational landscape of amyloid structure.一种探测淀粉样结构突变特征的方法。

Bioinformatics. 2011 Jul 1;27(13):i34-42. doi: 10.1093/bioinformatics/btr238.

Proteome-level interplay between folding and aggregation propensities of proteins.蛋白质折叠和聚集倾向的蛋白质组水平相互作用。

J Mol Biol. 2010 Oct 8;402(5):919-28. doi: 10.1016/j.jmb.2010.08.013. Epub 2010 Aug 13.

Exploring the sequence determinants of amyloid structure using position-specific scoring matrices.利用位置特异性评分矩阵探索淀粉样结构的序列决定因素。

Nat Methods. 2010 Mar;7(3):237-42. doi: 10.1038/nmeth.1432. Epub 2010 Feb 14.

Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies.使用简单的人工智能方法预测抗体中的淀粉样变性。

BMC Bioinformatics. 2010 Feb 8;11:79. doi: 10.1186/1471-2105-11-79.

Identifying the amylome, proteins capable of forming amyloid-like fibrils.鉴定淀粉样蛋白组，即能够形成淀粉样纤维的蛋白质。

Proc Natl Acad Sci U S A. 2010 Feb 23;107(8):3487-92. doi: 10.1073/pnas.0915166107. Epub 2010 Feb 3.

FoldAmyloid: a method of prediction of amyloidogenic regions from protein sequence.FoldAmyloid：一种从蛋白质序列预测淀粉样蛋白区域的方法。

Bioinformatics. 2010 Feb 1;26(3):326-32. doi: 10.1093/bioinformatics/btp691. Epub 2009 Dec 17.

Amyloidogenic determinants are usually not buried.淀粉样生成决定簇通常不被掩埋。

BMC Struct Biol. 2009 Jul 9;9:44. doi: 10.1186/1472-6807-9-44.

NetCSSP: web application for predicting chameleon sequences and amyloid fibril formation.NetCSSP：用于预测变色龙序列和淀粉样纤维形成的网络应用程序。

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W469-73. doi: 10.1093/nar/gkp351. Epub 2009 May 25.

BETASCAN: probable beta-amyloids identified by pairwise probabilistic analysis.BETASCAN：通过成对概率分析识别出的可能的β-淀粉样蛋白。

PLoS Comput Biol. 2009 Mar;5(3):e1000333. doi: 10.1371/journal.pcbi.1000333. Epub 2009 Mar 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

机器学习方法可替代 3D 构象分析法用于淀粉样六肽的分类。

Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献