在 DNA 语言模型中区分单词身份和序列上下文。

Distinguishing word identity and sequence context in DNA language models.

机构信息

Biomedical Genomics, Biotechnology Center, Center for Molecular and Cellular Bioengineering, Technische Universitat Dresden, Dresden, Germany.

National Center for Tumor Diseases, Partner site Dresden, German Cancer Research Center, Dresden, Germany.

出版信息

BMC Bioinformatics. 2024 Sep 13;25(1):301. doi: 10.1186/s12859-024-05869-5.

DOI:10.1186/s12859-024-05869-5

PMID:39272021

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11395559/

Abstract

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.

摘要

基于转换器的大型语言模型 (LLM) 非常适合生物序列数据，因为它们与自然语言类似。可以学习复杂的关系，因为可以通过标记化生成“单词”的概念。通过掩蔽令牌预测来训练模型，它们可以学习令牌序列身份和更大的序列上下文。我们开发了一种方法来询问模型的学习情况，这对于模型的可解释性以及评估其在特定任务中的潜力都很重要。我们使用了 DNABERT，这是一种在人类基因组上使用重叠 k-mer 作为标记训练的 DNA 语言模型。为了深入了解模型的学习情况，我们询问了模型如何进行预测，提取了令牌嵌入，并定义了一个微调基准任务，以预测不同大小的下一个令牌而没有重叠。这项任务评估基础模型，而无需询问特定的基因组生物学，它不依赖于标记化策略、词汇量、字典或训练参数的数量。最后，令牌身份不会泄露到预测任务中，这使得它特别适合评估序列上下文的学习情况。我们发现，使用重叠 k-mer 的模型难以学习更大的序列上下文。相反，学习到的嵌入主要表示令牌序列。尽管如此，对于受基因组生物学启发的微调任务，仍然可以取得良好的性能。具有重叠令牌的模型可能用于序列上下文不太重要的任务，但令牌序列直接表示所需的学习特征。这强调了需要询问生物 LLM 中的知识表示。