Suppr超能文献

研究非标准预训练的核苷酸序列上的 BERT 模型,并评估不同的 k-mer 嵌入。

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

机构信息

Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan.

出版信息

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad617.

Abstract

MOTIVATION

In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated.

RESULTS

In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks.

AVAILABILITY AND IMPLEMENTATION

The source code and associated data can be accessed at https://github.com/yaozhong/bert_investigation.

摘要

动机

近年来,基于转换器架构的预训练引起了广泛关注。虽然这种方法在各种下游任务中都取得了显著的性能提升,但预训练模型如何影响这些任务的基本机制,特别是在生物数据的背景下,还没有完全阐明。

结果

在这项研究中,我们专注于核苷酸序列的预训练,将来自 Transformer 的双向编码器表示(BERT)的预训练模型分解为其嵌入和编码模块,以分析预训练模型从核苷酸序列中学习到了什么。通过在数据和模型层面上对非标准预训练进行比较研究,我们发现典型的 BERT 模型学会在其嵌入模块中为其令牌表示捕获重叠一致的 k-mer 嵌入。有趣的是,与使用真实生物序列预训练的 k-mer 嵌入相比,使用随机数据预训练的 k-mer 嵌入在下游任务中可以产生相似的性能。我们进一步将学习到的 k-mer 嵌入与其他已建立的 k-mer 表示在基于序列的功能预测的下游任务中进行比较。我们的实验结果表明,从预训练中学习到的 k-mer 密集表示可以作为表示核苷酸序列的一种可行的替代方法,替代 one-hot 编码。此外,将预训练的 k-mer 嵌入与更简单的模型集成可以在两个典型的下游任务中实现有竞争力的性能。

可用性和实现

源代码和相关数据可在 https://github.com/yaozhong/bert_investigation 上访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a08/10612406/03b98e79cbd3/btad617f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验