Zhang Shuang, Fan Rui, Liu Yuti, Chen Shuang, Liu Qiao, Zeng Wanwen
College of Software, Nankai University, Tianjin 300350, China.
Department of Statistics, Stanford University, Stanford, CA 94305, USA.
Bioinform Adv. 2023 Jan 11;3(1):vbad001. doi: 10.1093/bioadv/vbad001. eCollection 2023.
The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural language processing (NLP). Since there are inherent similarities between various biological sequences and natural languages, the remarkable interpretability and adaptability of these models have prompted a new wave of their application in bioinformatics research. To provide a timely and comprehensive review, we introduce key developments of transformer-based language models by describing the detailed structure of transformers and summarize their contribution to a wide range of bioinformatics research from basic sequence analysis to drug discovery. While transformer-based applications in bioinformatics are diverse and multifaceted, we identify and discuss the common challenges, including heterogeneity of training data, computational expense and model interpretability, and opportunities in the context of bioinformatics research. We hope that the broader community of NLP researchers, bioinformaticians and biologists will be brought together to foster future research and development in transformer-based language models, and inspire novel bioinformatics applications that are unattainable by traditional methods.
Supplementary data are available at online.
基于Transformer的语言模型,包括原始Transformer、BERT和GPT-3,在自然语言处理(NLP)领域取得了革命性突破。由于各种生物序列与自然语言之间存在内在相似性,这些模型卓越的可解释性和适应性促使它们在生物信息学研究中掀起了新一轮应用热潮。为了及时提供全面的综述,我们通过描述Transformer的详细结构来介绍基于Transformer的语言模型的关键发展,并总结它们对从基本序列分析到药物发现等广泛生物信息学研究的贡献。虽然基于Transformer在生物信息学中的应用是多样且多方面的,但我们识别并讨论了共同挑战,包括训练数据的异质性、计算成本和模型可解释性,以及生物信息学研究背景下的机遇。我们希望将更广泛的NLP研究人员、生物信息学家和生物学家群体聚集在一起,以促进基于Transformer的语言模型的未来研究与开发,并激发传统方法无法实现的新型生物信息学应用。
补充数据可在网上获取。