Suppr超能文献

learnMSA2:基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。

learnMSA2: deep protein multiple alignments with large language and hidden Markov models.

机构信息

Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany.

出版信息

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.

Abstract

MOTIVATION

For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information.

RESULTS

We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models' embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation:  https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA.

摘要

动机

对于大量蛋白质序列的比对,主要的工具是那些仅使用简单的先验知识(例如氨基酸替换矩阵)并仅使用部分可用数据来决定对齐两个残基的工具。最先进程序的准确性随着序列同一性的降低以及要对齐的序列数量的增加而降低。最近,基于转换器的深度学习模型开始利用大量的蛋白质序列数据,从而产生了强大的预训练语言模型,其主要目的是为各个位置生成高维数值表示,即嵌入,这些嵌入聚合了进化、结构和生物物理信息。

结果

我们扩展了传统的轮廓隐马尔可夫模型,使其可以输入未对齐的蛋白质序列和相应的嵌入。我们使用现有的可微分隐马尔可夫层通过梯度下降来拟合模型。所有序列及其嵌入都被联合对齐到蛋白质家族的模型中。我们报告说,我们升级的基于 HMM 的对齐器 learnMSA2 与 ProtT5-XL 蛋白质语言模型结合使用,平均比最佳的基于氨基酸的竞争对手正确对齐了近 6%的列,并且随着序列数量的增加而很好地扩展。当序列同一性较低且序列数量较大时,learnMSA2 相对于其他程序的相对优势往往更大。我们的结果加强了蛋白质语言模型的嵌入中包含的丰富信息的证据,以及它们对生物信息学领域的潜在下游影响。

可用性和实现

https://github.com/Gaius-Augustus/learnMSA、PyPI 和 Bioconda,评估:https://github.com/felbecker/snakeMSA。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/da7d/11373405/12eb35108d96/btae381f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验