rawMSA：使用原始多序列比对的端到端深度学习。

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments.

机构信息

IFM Bioinformatics, Linköping University, Linköping, Sweden.

出版信息

PLoS One. 2019 Aug 15;14(8):e0220182. doi: 10.1371/journal.pone.0220182. eCollection 2019.

DOI:10.1371/journal.pone.0220182

PMID:31415569

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6695225/

Abstract

In the last decades, huge efforts have been made in the bioinformatics community to develop machine learning-based methods for the prediction of structural features of proteins in the hope of answering fundamental questions about the way proteins function and their involvement in several illnesses. The recent advent of Deep Learning has renewed the interest in neural networks, with dozens of methods being developed taking advantage of these new architectures. However, most methods are still heavily based pre-processing of the input data, as well as extraction and integration of multiple hand-picked, and manually designed features. Multiple Sequence Alignments (MSA) are the most common source of information in de novo prediction methods. Deep Networks that automatically refine the MSA and extract useful features from it would be immensely powerful. In this work, we propose a new paradigm for the prediction of protein structural features called rawMSA. The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space. This allows the whole MSA to be input into a Deep Network, thus rendering pre-calculated features such as sequence profiles and other features calculated from MSA obsolete. We showcased the rawMSA methodology on three different prediction problems: secondary structure, relative solvent accessibility and inter-residue contact maps. We have rigorously trained and benchmarked rawMSA on a large set of proteins and have determined that it outperforms classical methods based on position-specific scoring matrices (PSSM) when predicting secondary structure and solvent accessibility, while performing on par with methods using more pre-calculated features in the inter-residue contact map prediction category in CASP12 and CASP13. Clearly demonstrating that rawMSA represents a promising development that can pave the way for improved methods using rawMSA instead of sequence profiles to represent evolutionary information in the coming years. Availability: datasets, dataset generation code, evaluation code and models are available at: https://bitbucket.org/clami66/rawmsa.

摘要

在过去的几十年中，生物信息学领域做出了巨大的努力，开发了基于机器学习的方法来预测蛋白质的结构特征，以期回答关于蛋白质功能及其在多种疾病中作用的基本问题。深度学习的出现重新激发了人们对神经网络的兴趣，开发了数十种利用这些新架构的方法。然而，大多数方法仍然严重依赖于输入数据的预处理，以及提取和整合多个手工挑选的、手动设计的特征。多重序列比对（MSA）是从头预测方法中最常见的信息来源。能够自动改进 MSA 并从中提取有用特征的深度网络将具有巨大的威力。在这项工作中，我们提出了一种称为原始 MSA 的蛋白质结构特征预测的新范例。原始 MSA 的核心思想来自自然语言处理领域，将氨基酸序列映射到自适应学习的连续空间中。这允许将整个 MSA 输入到深度网络中，从而使预先计算的特征（如序列谱和从 MSA 计算的其他特征）变得过时。我们在三个不同的预测问题上展示了 rawMSA 方法：二级结构、相对溶剂可及性和残基间接触图。我们已经在大量蛋白质上严格训练和基准测试了 rawMSA，并确定它在预测二级结构和溶剂可及性方面优于基于位置特异性评分矩阵（PSSM）的经典方法，而在 CASP12 和 CASP13 的残基间接触图预测类别中使用更多预先计算的特征的方法表现相当。这清楚地表明，rawMSA 代表了一种有前途的发展，它可以为未来几年使用 rawMSA 而不是序列谱来表示进化信息的改进方法铺平道路。

可用性

数据集、数据集生成代码、评估代码和模型可在以下网址获得：https://bitbucket.org/clami66/rawmsa。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e86/6695225/def6d4c41ccf/pone.0220182.g001.jpg

相似文献

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments.rawMSA：使用原始多序列比对的端到端深度学习。

PLoS One. 2019 Aug 15;14(8):e0220182. doi: 10.1371/journal.pone.0220182. eCollection 2019.

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.通过整合深度多序列比对、协同进化和机器学习进行蛋白质接触预测。

Proteins. 2018 Mar;86 Suppl 1(Suppl 1):84-96. doi: 10.1002/prot.25405. Epub 2017 Oct 31.

DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment.DeepECA：一种基于多重序列比对的蛋白质接触预测端到端学习框架。

BMC Bioinformatics. 2020 Jan 9;21(1):10. doi: 10.1186/s12859-019-3190-x.

Comprehensive Study on Enhancing Low-Quality Position-Specific Scoring Matrix with Deep Learning for Accurate Protein Structure Property Prediction: Using Bagging Multiple Sequence Alignment Learning.利用Bagging多序列比对学习，通过深度学习增强低质量位置特异性评分矩阵以进行准确蛋白质结构特性预测的综合研究

J Comput Biol. 2021 Apr;28(4):346-361. doi: 10.1089/cmb.2020.0416. Epub 2021 Feb 22.

Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation.先验知识有助于通过 DSM 蒸馏进行低同源蛋白二级结构预测。

Bioinformatics. 2022 Jul 11;38(14):3574-3581. doi: 10.1093/bioinformatics/btac351.

Deep-learning contact-map guided protein structure prediction in CASP13.深度学习接触图指导的 CASP13 蛋白质结构预测。

Proteins. 2019 Dec;87(12):1149-1164. doi: 10.1002/prot.25792. Epub 2019 Aug 14.

ComplexContact: a web server for inter-protein contact prediction using deep learning.复杂接触：一个使用深度学习进行蛋白质间接触预测的网络服务器。

Nucleic Acids Res. 2018 Jul 2;46(W1):W432-W437. doi: 10.1093/nar/gky420.

Ensembling multiple raw coevolutionary features with deep residual neural networks for contact-map prediction in CASP13.基于深度残差神经网络的原始共进化特征集成方法在 CASP13 中用于接触图预测。

Proteins. 2019 Dec;87(12):1082-1091. doi: 10.1002/prot.25798. Epub 2019 Aug 22.

Analysis of distance-based protein structure prediction by deep learning in CASP13.基于深度学习的 CASP13 蛋白质结构预测距离分析。

Proteins. 2019 Dec;87(12):1069-1081. doi: 10.1002/prot.25810. Epub 2019 Sep 13.

Improved protein relative solvent accessibility prediction using deep multi-view feature learning framework.利用深度多视图特征学习框架提高蛋白质相对溶剂可及性预测。

Anal Biochem. 2021 Oct 15;631:114358. doi: 10.1016/j.ab.2021.114358. Epub 2021 Aug 31.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学：生物信息学中大型语言模型的全面综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences.LoRA-DR套件：适配嵌入从蛋白质序列预测内在和软性无序。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i439-i448. doi: 10.1093/bioinformatics/btaf185.

Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences.蛋白质序列中核酸结合残基预测二十年进展

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf016.

DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction.使用基于深度学习的蛋白质功能预测对微生物群落进行功能洞察的DeepGOMeta

Sci Rep. 2024 Dec 30;14(1):31813. doi: 10.1038/s41598-024-82956-w.

Accurate and Fast Prediction of Intrinsic Disorder Using flDPnn.使用 flDPnn 进行精确快速的固有无序预测。

Methods Mol Biol. 2025;2867:201-218. doi: 10.1007/978-1-0716-4196-5_12.

Beyond AlphaFold2: The Impact of AI for the Further Improvement of Protein Structure Prediction.超越 AlphaFold2：人工智能对进一步改进蛋白质结构预测的影响。

Methods Mol Biol. 2025;2867:121-139. doi: 10.1007/978-1-0716-4196-5_7.

An outlook on structural biology after AlphaFold: tools, limits and perspectives.AlphaFold之后的结构生物学展望：工具、局限与前景

FEBS Open Bio. 2025 Feb;15(2):202-222. doi: 10.1002/2211-5463.13902. Epub 2024 Sep 23.

Taxonomy-specific assessment of intrinsic disorder predictions at residue and region levels in higher eukaryotes, protists, archaea, bacteria and viruses.对高等真核生物、原生生物、古细菌、细菌和病毒中残基和区域水平的内在无序预测进行分类学特异性评估。

Comput Struct Biotechnol J. 2024 Apr 27;23:1968-1977. doi: 10.1016/j.csbj.2024.04.059. eCollection 2024 Dec.

Assessment of Disordered Linker Predictions in the CAID2 Experiment.CAID2 实验中无序连接预测的评估。

Biomolecules. 2024 Feb 28;14(3):287. doi: 10.3390/biom14030287.

Comparative evaluation of AlphaFold2 and disorder predictors for prediction of intrinsic disorder, disorder content and fully disordered proteins.用于预测内在无序、无序含量和完全无序蛋白质的AlphaFold2与无序预测器的比较评估

Comput Struct Biotechnol J. 2023 Jun 2;21:3248-3258. doi: 10.1016/j.csbj.2023.06.001. eCollection 2023.

本文引用的文献

Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。

Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.

RaptorX-Angle: real-value prediction of protein backbone dihedral angles through a hybrid method of clustering and deep learning. RaptorX-Angle：通过聚类和深度学习的混合方法实现蛋白质主链二面角的实值预测。

BMC Bioinformatics. 2018 May 8;19(Suppl 4):100. doi: 10.1186/s12859-018-2065-x.

MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.MUFOLD-SS：用于蛋白质二级结构预测的新深度 inception-inside-inception 网络。

Proteins. 2018 May;86(5):592-598. doi: 10.1002/prot.25487. Epub 2018 Mar 12.

DNCON2: improved protein contact prediction using two-level deep convolutional neural networks.DNCON2：使用两级深度卷积神经网络改进蛋白质接触预测。

Bioinformatics. 2018 May 1;34(9):1466-1472. doi: 10.1093/bioinformatics/btx781.

Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age.蛋白质结构预测技术关键评估第12轮（CASP12）中的接触预测评估：协同进化与深度学习走向成熟。

Proteins. 2018 Mar;86 Suppl 1(Suppl Suppl 1):51-66. doi: 10.1002/prot.25407. Epub 2017 Nov 7.

Improved protein contact predictions with the MetaPSICOV2 server in CASP12.在蛋白质结构预测技术关键评估第12轮（CASP12）中使用MetaPSICOV2服务器改进蛋白质接触预测。

Proteins. 2018 Mar;86 Suppl 1(Suppl Suppl 1):78-83. doi: 10.1002/prot.25379. Epub 2017 Sep 29.

Analysis of deep learning methods for blind protein contact prediction in CASP12.CASP12中用于蛋白质盲态接触预测的深度学习方法分析

Proteins. 2018 Mar;86 Suppl 1(Suppl 1):67-77. doi: 10.1002/prot.25377. Epub 2017 Sep 6.

Proteus: a random forest classifier to predict disorder-to-order transitioning binding regions in intrinsically disordered proteins.Proteus：一种用于预测内在无序蛋白质中无序到有序转变结合区域的随机森林分类器。

J Comput Aided Mol Des. 2017 May;31(5):453-466. doi: 10.1007/s10822-017-0020-y. Epub 2017 Apr 1.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.基于超深度学习模型的蛋白质接触图从头精确预测

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

ProQ3D: improved model quality assessments using deep learning.ProQ3D：使用深度学习改进模型质量评估。

Bioinformatics. 2017 May 15;33(10):1578-1580. doi: 10.1093/bioinformatics/btw819.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

rawMSA：使用原始多序列比对的端到端深度学习。

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments.

机构信息

出版信息

可用性

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献