Lindsey LeAnn M, Pershing Nicole L, Habib Anisa, Dufault-Thompson Keith, Stephens W Zac, Blaschke Anne J, Jiang Xiaofang, Sundar Hari
Kahlert School of Computing, University of Utah, SLC, UT, USA.
National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
bioRxiv. 2025 Jul 26:2024.09.09.612081. doi: 10.1101/2024.09.09.612081.
Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make sub-word tokenization in genomic language models significantly different from both traditional language models and protein language models. This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on forty-four classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms sub-word tokenization methods on tasks that rely on nucleotide level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.
基因组语言模型最近作为一种解码、解释和生成基因序列的新方法出现。现有的基因组语言模型采用了各种分词方法,包括字符分词、重叠和非重叠的k-mer分词,以及字节对编码,这是一种在自然语言模型中广泛使用的方法。基因组序列与自然语言不同,因为它们的字符变异性低、特征复杂且重叠,以及方向性不一致。这些特征使得基因组语言模型中的子词分词与传统语言模型和蛋白质语言模型都有显著差异。本研究通过评估基因组语言模型在44个分类微调任务上的下游性能,探讨了分词在基因组语言模型中的影响。我们还在状态空间模型Mamba中对字节对编码和字符分词进行了直接比较。我们的结果表明,在依赖核苷酸水平分辨率的任务上,如剪接位点预测和启动子检测,字符分词优于子词分词方法。虽然字节对分词在SARS-CoV-2变体分类任务上表现更强,但我们观察到在其余下游任务上,分词方法之间的统计显著差异有限。