文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

The language of proteins: NLP, machine learning & protein sequences.

作者信息

Ofer Dan, Brandes Nadav, Linial Michal

机构信息

Medtronic, Inc, Israel.

The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.

出版信息

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.


DOI:10.1016/j.csbj.2021.03.022
PMID:33897979
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8050421/
Abstract

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

摘要
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/47f825dc7344/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/e7ebfb85c74f/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/47f825dc7344/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/e7ebfb85c74f/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8aea/8050421/47f825dc7344/gr2.jpg

相似文献

[1]
The language of proteins: NLP, machine learning & protein sequences.

Comput Struct Biotechnol J. 2021-3-25

[2]
A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.

Brief Bioinform. 2021-9-2

[3]
Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning.

Sensors (Basel). 2022-5-30

[4]
Impact of word embedding models on text analytics in deep learning environment: a review.

Artif Intell Rev. 2023-2-22

[5]
Prediction of Stroke Outcome Using Natural Language Processing-Based Machine Learning of Radiology Report of Brain MRI.

J Pers Med. 2020-12-16

[6]
Deep Learning for Natural Language Processing in Radiology-Fundamentals and a Systematic Review.

J Am Coll Radiol. 2020-5

[7]
Leveraging transformers-based language models in proteome bioinformatics.

Proteomics. 2023-12

[8]
Essential Elements of Natural Language Processing: What the Radiologist Should Know.

Acad Radiol. 2019-9-17

[9]
Effect of tokenization on transformers for biological sequences.

Bioinformatics. 2024-3-29

[10]
Natural Language Processing Applications in the Clinical Neurosciences: A Machine Learning Augmented Systematic Review.

Acta Neurochir Suppl. 2022

引用本文的文献

[1]
Progress and trends on machine learning in proteomics during 1997-2024: a bibliometric analysis.

Front Med (Lausanne). 2025-8-15

[2]
Pathway Analysis Interpretation in the Multi-Omic Era.

BioTech (Basel). 2025-7-29

[3]
Target sequence-conditioned design of peptide binders using masked language modeling.

Nat Biotechnol. 2025-8-13

[4]
Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation.

bioRxiv. 2025-7-23

[5]
Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference.

PLoS One. 2025-8-6

[6]
Generative and Contrastive Self-Supervised Learning for Virulence Factor Identification Based on Protein-Protein Interaction Networks.

Microorganisms. 2025-7-10

[7]
Progress and challenges for the application of machine learning for neglected tropical diseases.

F1000Res. 2025-5-20

[8]
spRefine Denoises and Imputes Spatial Transcriptomics with a Reference-Free Framework Powered by Genomic Language Model.

bioRxiv. 2025-7-7

[9]
A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).

Methods Mol Biol. 2025

[10]
Predicting the Pathogenicity of Human Protein Variants: Not Only a Matter of Residue Labeling.

Methods Mol Biol. 2025

本文引用的文献

[1]
Using deep learning to annotate the protein universe.

Nat Biotechnol. 2022-6

[2]
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Proc Natl Acad Sci U S A. 2021-4-13

[3]
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

Bioinformatics. 2021-8-9

[4]
Learning the language of viral evolution and escape.

Science. 2021-1-15

[5]
Embeddings from deep learning transfer GO annotations beyond homology.

Sci Rep. 2021-1-13

[6]
Evaluating Protein Transfer Learning with TAPE.

Adv Neural Inf Process Syst. 2019-12

[7]
On biases of attention in scientific discovery.

Bioinformatics. 2021-4-1

[8]
Deep learning embedder method and tool for mass spectra similarity search.

J Proteomics. 2021-2-10

[9]
Deep Learning in Proteomics.

Proteomics. 2020-11

[10]
Signal Peptides Generated by Attention-Based Neural Networks.

ACS Synth Biol. 2020-8-21

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索