Suppr超能文献

N 元分析揭示的淀粉样肽生成基序。

Amyloidogenic motifs revealed by n-gram analysis.

机构信息

Department of Genomics, University of Wrocław, Wrocław, Poland.

Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland.

出版信息

Sci Rep. 2017 Oct 11;7(1):12961. doi: 10.1038/s41598-017-13210-9.

Abstract

Amyloids are proteins associated with several clinical disorders, including Alzheimer's, and Creutzfeldt-Jakob's. Despite their diversity, all amyloid proteins can undergo aggregation initiated by short segments called hot spots. To find the patterns defining the hot spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on their more general properties, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids which are strongly correlated with hydrophobicity, a tendency to form β-sheets, and lower flexibility of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were previously confirmed experimentally. AmyloGram is available as the web-server: http://smorfland.uni.wroc.pl/shiny/AmyloGram/ and as the R package AmyloGram. R scripts and data used to produce the results of this manuscript are available at http://github.com/michbur/AmyloGramAnalysis .

摘要

淀粉样蛋白与多种临床疾病有关,包括阿尔茨海默病和克雅氏病。尽管它们具有多样性,但所有淀粉样蛋白都可以通过称为热点的短片段引发聚集。为了找到定义热点的模式,我们使用 n-gram 和随机森林分类器训练了淀粉样蛋白原性的预测器。由于淀粉样蛋白原性可能不取决于氨基酸的精确序列,而是取决于它们更一般的性质,我们测试了 524,284 个不同长度(三到六个字母)的简化氨基酸字母,以找到在交叉验证中表现最佳的字母。基于此字母的预测器称为 AmyloGram,使用外部数据集对用于检测淀粉样肽的最流行工具进行了基准测试,并获得了性能指标(AUC:0.90,MCC:0.63)的最高值。我们的结果显示,淀粉样蛋白中的序列模式与疏水性、形成β-折叠的趋势以及氨基酸残基的柔韧性较低密切相关。在 AmyloGram 的最具信息量的 n-gram 中,我们确定了 15 个以前在实验中得到证实的 n-gram。AmyloGram 可作为网络服务器:http://smorfland.uni.wroc.pl/shiny/AmyloGram/,也可作为 R 包 AmyloGram。用于生成本文献结果的 R 脚本和数据可在 http://github.com/michbur/AmyloGramAnalysis 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bbe4/5636826/c9efdbfb7912/41598_2017_13210_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验