Benegas Gonzalo, Ye Chengzhong, Albors Carlos, Li Jianan Canal, Song Yun S
Computer Science Division, University of California, Berkeley.
Department of Statistics, University of California, Berkeley.
ArXiv. 2024 Sep 22:arXiv:2407.11435v2.
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
大语言模型(LLMs)正在对广泛的科学领域产生变革性影响,尤其是在生物医学领域。正如自然语言处理的目标是理解单词序列一样,生物学的一个主要目标是理解生物序列。基因组语言模型(gLMs)是在DNA序列上训练的大语言模型,它们有可能显著推进我们对基因组以及不同尺度的DNA元件如何相互作用以产生复杂功能的理解。为了展示这种潜力,我们重点介绍了基因组语言模型的关键应用,包括功能约束预测、序列设计和迁移学习。然而,尽管最近取得了显著进展,但开发有效且高效的基因组语言模型仍面临众多挑战,特别是对于具有庞大复杂基因组的物种。在这里,我们讨论了开发和评估基因组语言模型的主要考虑因素。