Ofer Dan, Brandes Nadav, Linial Michal
Medtronic, Inc, Israel.
The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
自然语言处理(NLP)是计算机科学领域中与自动化文本和语言分析相关的一个领域。近年来,随着深度学习和机器学习的一系列突破,NLP方法取得了巨大进展。在此,我们回顾将NLP算法应用于蛋白质研究的成功之处、前景和陷阱。蛋白质可以表示为氨基酸字母串,很适合许多NLP方法。我们探讨蛋白质与语言之间的概念异同,并回顾一系列适合机器学习的与蛋白质相关的任务。我们介绍将蛋白质信息编码为文本并用NLP方法进行分析的方法,回顾诸如词袋模型、k-mer/n-gram和文本搜索等经典概念,以及诸如词嵌入、上下文嵌入、深度学习和神经语言模型等现代技术。特别是,我们关注诸如掩码语言建模、自监督学习和基于注意力的模型等近期创新。最后,我们讨论NLP与蛋白质研究交叉领域的趋势和挑战。