基于可微分 Smith-Waterman 的多序列比对端到端学习。

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.

机构信息

NSF-Simons Center for the Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, MA 02138, USA.

Department of Mathematics, University of California Berkeley, Berkeley, CA 94720, USA.

出版信息

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

DOI:10.1093/bioinformatics/btac724

PMID:36355460

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9805565/

Abstract

MOTIVATION

Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for.

RESULTS

Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood.

AVAILABILITY AND IMPLEMENTATION

Our code and examples are available at: https://github.com/spetti/SMURF.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

同源序列的多重序列比对 (MSA) 包含有关结构和功能约束及其进化历史的信息。尽管它们对于许多下游任务（如结构预测）非常重要，但 MSA 的生成通常被视为一个单独的预处理步骤，而没有任何关于其应用的指导。

结果

在这里，我们实现了一种平滑可微的 Smith-Waterman 两两比对算法版本，能够以端到端的方式联合学习 MSA 和下游机器学习系统。为了证明其效用，我们引入了 SMURF（平滑 Markov 未对齐随机场），这是一种新的方法，它联合学习对齐和 Markov 随机场的参数，用于无监督接触预测。我们发现，SMURF 学习的 MSA 可以在各种蛋白质和 RNA 家族上适度提高接触预测的准确性。作为概念验证，我们证明通过将我们的可微分对齐模块连接到 AlphaFold2 并最大化预测置信度，我们可以学习到可以提高初始 MSA 结构预测的 MSA。有趣的是，改善 AlphaFold 预测的比对是自相矛盾的，可以视为对抗性的。这项工作强调了可微分动态规划在改善依赖于比对的神经网络管道方面的潜力，以及使用不完全理解的方法优化蛋白质序列预测的潜在危险。

可用性和实现

我们的代码和示例可在以下网址获得：https://github.com/spetti/SMURF。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/673e/9805565/64d7433008de/btac724f1.jpg

相似文献

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment.DeepECA：一种基于多重序列比对的蛋白质接触预测端到端学习框架。

BMC Bioinformatics. 2020 Jan 9;21(1):10. doi: 10.1186/s12859-019-3190-x.

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features.利用全卷积神经网络和最小序列特征进行高精度蛋白质接触预测。

Bioinformatics. 2018 Oct 1;34(19):3308-3315. doi: 10.1093/bioinformatics/bty341.

Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。

Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.

Highly significant improvement of protein sequence alignments with AlphaFold2.使用 AlphaFold2 大幅提高蛋白质序列比对的精确度。

Bioinformatics. 2022 Nov 15;38(22):5007-5011. doi: 10.1093/bioinformatics/btac625.

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins.DeepMSA：构建深度多重序列比对以改进远距离同源蛋白质的接触预测和折叠识别。

Bioinformatics. 2020 Apr 1;36(7):2105-2112. doi: 10.1093/bioinformatics/btz863.

Identifying functionally informative evolutionary sequence profiles.识别具有功能信息的进化序列特征。

Bioinformatics. 2018 Apr 15;34(8):1278-1286. doi: 10.1093/bioinformatics/btx779.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.利用多序列比对增强和预训练语言模型提高同源蛋白不足的结构相关预测。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad217.

Seq-SetNet: directly exploiting multiple sequence alignment for protein secondary structure prediction.Seq-SetNet：直接利用多重序列比对进行蛋白质二级结构预测。

Bioinformatics. 2022 Jan 27;38(4):990-996. doi: 10.1093/bioinformatics/btab777.

Improving deep learning-based protein distance prediction in CASP14.在蛋白质结构预测关键评估第14轮（CASP14）中改进基于深度学习的蛋白质距离预测

Bioinformatics. 2021 Oct 11;37(19):3190-3196. doi: 10.1093/bioinformatics/btab355.

引用本文的文献

Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks.Phyloformer：使用深度神经网络进行快速、准确且通用的系统发育重建。

Mol Biol Evol. 2025 Apr 1;42(4). doi: 10.1093/molbev/msaf051.

Exploring Evolution to Uncover Insights Into Protein Mutational Stability.探索进化以揭示蛋白质突变稳定性的见解。

Mol Biol Evol. 2025 Jan 6;42(1). doi: 10.1093/molbev/msae267.

PROTA: A Robust Tool for Protamine Prediction Using a Hybrid Approach of Machine Learning and Deep Learning.PROTA：一种使用机器学习和深度学习混合方法的鱼精蛋白预测的强大工具。

Int J Mol Sci. 2024 Sep 24;25(19):10267. doi: 10.3390/ijms251910267.

learnMSA2: deep protein multiple alignments with large language and hidden Markov models.learnMSA2：基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.

Differentiable phylogenetics hyperbolic embeddings with Dodonaphy.使用Dodonaphy的可微系统发育双曲嵌入

Bioinform Adv. 2024 Jun 19;4(1):vbae082. doi: 10.1093/bioadv/vbae082. eCollection 2024.

Improved protein complex prediction with AlphaFold-multimer by denoising the MSA profile.利用 AlphaFold-multimer 对 MSA 谱图进行去噪，提高蛋白质复合物预测能力。

PLoS Comput Biol. 2024 Jul 25;20(7):e1012253. doi: 10.1371/journal.pcbi.1012253. eCollection 2024 Jul.

Sensitive remote homology search by local alignment of small positional embeddings from protein language models.通过蛋白质语言模型的小位置嵌入进行局部比对实现敏感的远程同源性搜索。

Elife. 2024 Mar 15;12:RP91415. doi: 10.7554/eLife.91415.

Differentiable partition function calculation for RNA.RNA 的可微分区函数计算。

Nucleic Acids Res. 2024 Feb 9;52(3):e14. doi: 10.1093/nar/gkad1168.

PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features.PepCNN 深度学习工具，用于使用序列、结构和语言模型特征预测蛋白质中的肽结合残基。

Sci Rep. 2023 Nov 28;13(1):20882. doi: 10.1038/s41598-023-47624-5.

Alignment-based Protein Mutational Landscape Prediction: Doing More with Less.基于比对的蛋白质突变景观预测：用更少的资源做更多的事情。

Genome Biol Evol. 2023 Nov 1;15(11). doi: 10.1093/gbe/evad201.

本文引用的文献

Deep embedding and alignment of protein sequences.蛋白质序列的深度嵌入与比对

Nat Methods. 2023 Jan;20(1):104-111. doi: 10.1038/s41592-022-01700-2. Epub 2022 Dec 15.

ColabFold: making protein folding accessible to all.ColabFold：让蛋白质折叠变得人人可用。

Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.通过深度表示学习进行RNA结构比对和聚类的信息性RNA碱基嵌入

NAR Genom Bioinform. 2022 Feb 22;4(1):lqac012. doi: 10.1093/nargab/lqac012. eCollection 2022 Mar.

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention.通过简化注意力的视角来解释 Potts 和 Transformer 蛋白模型。

Pac Symp Biocomput. 2022;27:34-45.

Disease variant prediction with deep generative models of evolutionary data.利用进化数据的深度生成模型进行疾病变异预测。

Nature. 2021 Nov;599(7883):91-95. doi: 10.1038/s41586-021-04043-8. Epub 2021 Oct 27.

Accurate prediction of protein structures and interactions using a three-track neural network.使用三轨神经网络准确预测蛋白质结构和相互作用。

Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Aligning biological sequences by exploiting residue conservation and coevolution.利用残基保守性和共进化进行生物序列比对。

Phys Rev E. 2020 Dec;102(6-1):062409. doi: 10.1103/PhysRevE.102.062409.

Remote homology search with hidden Potts models.使用隐式 Potts 模型进行远程同源搜索。

PLoS Comput Biol. 2020 Nov 30;16(11):e1008085. doi: 10.1371/journal.pcbi.1008085. eCollection 2020 Nov.

Rfam 14: expanded coverage of metagenomic, viral and microRNA families.Rfam 14：扩展了对宏基因组、病毒和 miRNA 家族的覆盖范围。

Nucleic Acids Res. 2021 Jan 8;49(D1):D192-D200. doi: 10.1093/nar/gkaa1047.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于可微分 Smith-Waterman 的多序列比对端到端学习。

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献