蛋白质的语言：自然语言处理、机器学习与蛋白质序列

The language of proteins: NLP, machine learning & protein sequences.

作者信息

Ofer Dan, Brandes Nadav, Linial Michal

机构信息

Medtronic, Inc, Israel.

The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.

出版信息

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

DOI:10.1016/j.csbj.2021.03.022

PMID:33897979

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8050421/

Abstract

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

摘要

自然语言处理（NLP）是计算机科学领域中与自动化文本和语言分析相关的一个领域。近年来，随着深度学习和机器学习的一系列突破，NLP方法取得了巨大进展。在此，我们回顾将NLP算法应用于蛋白质研究的成功之处、前景和陷阱。蛋白质可以表示为氨基酸字母串，很适合许多NLP方法。我们探讨蛋白质与语言之间的概念异同，并回顾一系列适合机器学习的与蛋白质相关的任务。我们介绍将蛋白质信息编码为文本并用NLP方法进行分析的方法，回顾诸如词袋模型、k-mer/n-gram和文本搜索等经典概念，以及诸如词嵌入、上下文嵌入、深度学习和神经语言模型等现代技术。特别是，我们关注诸如掩码语言建模、自监督学习和基于注意力的模型等近期创新。最后，我们讨论NLP与蛋白质研究交叉领域的趋势和挑战。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/e7ebfb85c74f/gr1.jpg

相似文献

The language of proteins: NLP, machine learning & protein sequences.蛋白质的语言：自然语言处理、机器学习与蛋白质序列

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.基于 BERT 和二维卷积神经网络的变压器架构，用于从序列信息中识别 DNA 增强子。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.

Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning.基于 BERT 的有监督微调的情感分析中的迁移学习。

Sensors (Basel). 2022 May 30;22(11):4157. doi: 10.3390/s22114157.

Impact of word embedding models on text analytics in deep learning environment: a review.词嵌入模型对深度学习环境下文本分析的影响：综述

Artif Intell Rev. 2023 Feb 22:1-81. doi: 10.1007/s10462-023-10419-1.

Prediction of Stroke Outcome Using Natural Language Processing-Based Machine Learning of Radiology Report of Brain MRI.使用基于自然语言处理的脑磁共振成像放射学报告机器学习预测卒中结局

J Pers Med. 2020 Dec 16;10(4):286. doi: 10.3390/jpm10040286.

Deep Learning for Natural Language Processing in Radiology-Fundamentals and a Systematic Review.放射学中自然语言处理的深度学习——基础与系统综述

J Am Coll Radiol. 2020 May;17(5):639-648. doi: 10.1016/j.jacr.2019.12.026. Epub 2020 Jan 28.

Leveraging transformers-based language models in proteome bioinformatics.基于转换器的语言模型在蛋白质组生物信息学中的应用。

Proteomics. 2023 Dec;23(23-24):e2300011. doi: 10.1002/pmic.202300011. Epub 2023 Jun 29.

Essential Elements of Natural Language Processing: What the Radiologist Should Know.自然语言处理的基本要素：放射科医生应该知道的内容。

Acad Radiol. 2020 Jan;27(1):6-12. doi: 10.1016/j.acra.2019.08.010. Epub 2019 Sep 17.

Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。

Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.

Natural Language Processing Applications in the Clinical Neurosciences: A Machine Learning Augmented Systematic Review.自然语言处理在临床神经科学中的应用：机器学习增强的系统综述。

Acta Neurochir Suppl. 2022;134:277-289. doi: 10.1007/978-3-030-85292-4_32.

引用本文的文献

Progress and trends on machine learning in proteomics during 1997-2024: a bibliometric analysis.1997 - 2024年蛋白质组学中机器学习的进展与趋势：文献计量分析

Front Med (Lausanne). 2025 Aug 15;12:1594442. doi: 10.3389/fmed.2025.1594442. eCollection 2025.

Pathway Analysis Interpretation in the Multi-Omic Era.多组学时代的通路分析解读

BioTech (Basel). 2025 Jul 29;14(3):58. doi: 10.3390/biotech14030058.

Target sequence-conditioned design of peptide binders using masked language modeling.使用掩码语言建模的肽结合物的靶序列条件设计。

Nat Biotechnol. 2025 Aug 13. doi: 10.1038/s41587-025-02761-2.

Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation.蛋白质语言模型识别出与相分离相关的无序保守基序。

bioRxiv. 2025 Jul 23:2024.12.12.628175. doi: 10.1101/2024.12.12.628175.

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference.词袋模型在蛋白质推理方面与基于语言启发的词嵌入求和表示法具有竞争力。

PLoS One. 2025 Aug 6;20(8):e0325531. doi: 10.1371/journal.pone.0325531. eCollection 2025.

Generative and Contrastive Self-Supervised Learning for Virulence Factor Identification Based on Protein-Protein Interaction Networks.基于蛋白质-蛋白质相互作用网络的毒力因子识别的生成式和对比式自监督学习

Microorganisms. 2025 Jul 10;13(7):1635. doi: 10.3390/microorganisms13071635.

Progress and challenges for the application of machine learning for neglected tropical diseases.机器学习在 neglected tropical diseases 中的应用进展与挑战。（注：“neglected tropical diseases”直译为“被忽视的热带病” ）

F1000Res. 2025 May 20;12:287. doi: 10.12688/f1000research.129064.2. eCollection 2023.

spRefine Denoises and Imputes Spatial Transcriptomics with a Reference-Free Framework Powered by Genomic Language Model.spRefine：使用由基因组语言模型驱动的无参考框架对空间转录组学进行去噪和插补。

bioRxiv. 2025 Jul 7:2025.04.22.649977. doi: 10.1101/2025.04.22.649977.

A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).以自然语言处理（NLP）和大语言模型（LLM）为重点的生物功能预测方法综述。

Methods Mol Biol. 2025;2941:201-225. doi: 10.1007/978-1-0716-4623-6_13.

Predicting the Pathogenicity of Human Protein Variants: Not Only a Matter of Residue Labeling.预测人类蛋白质变体的致病性：不仅仅是残基标记的问题。

Methods Mol Biol. 2025;2941:189-199. doi: 10.1007/978-1-0716-4623-6_12.

本文引用的文献

Using deep learning to annotate the protein universe.利用深度学习标注蛋白质宇宙。

Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

Learning the language of viral evolution and escape.学习病毒进化与逃逸的语言。

Science. 2021 Jan 15;371(6526):284-288. doi: 10.1126/science.abd7331.

Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。

Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

On biases of attention in scientific discovery.论科学发现中的注意力偏差。

Bioinformatics. 2021 Apr 1;36(22-23):5269-5274. doi: 10.1093/bioinformatics/btaa1036.

Deep learning embedder method and tool for mass spectra similarity search.用于质谱相似性搜索的深度学习嵌入器方法和工具。

J Proteomics. 2021 Feb 10;232:104070. doi: 10.1016/j.jprot.2020.104070. Epub 2020 Dec 8.

Deep Learning in Proteomics.蛋白质组学中的深度学习。

Proteomics. 2020 Nov;20(21-22):e1900335. doi: 10.1002/pmic.201900335. Epub 2020 Oct 30.

Signal Peptides Generated by Attention-Based Neural Networks.基于注意力机制的神经网络生成的信号肽。

ACS Synth Biol. 2020 Aug 21;9(8):2154-2161. doi: 10.1021/acssynbio.0c00219. Epub 2020 Jul 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

蛋白质的语言：自然语言处理、机器学习与蛋白质序列

The language of proteins: NLP, machine learning & protein sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献