• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因组语言模型中分词器选择的影响

The Impact of Tokenizer Selection in Genomic Language Models.

作者信息

Lindsey LeAnn M, Pershing Nicole L, Habib Anisa, Dufault-Thompson Keith, Stephens W Zac, Blaschke Anne J, Jiang Xiaofang, Sundar Hari

机构信息

Kahlert School of Computing, University of Utah, SLC, UT, USA.

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

出版信息

bioRxiv. 2025 Jul 26:2024.09.09.612081. doi: 10.1101/2024.09.09.612081.

DOI:10.1101/2024.09.09.612081
PMID:40777424
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12330496/
Abstract

Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make sub-word tokenization in genomic language models significantly different from both traditional language models and protein language models. This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on forty-four classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms sub-word tokenization methods on tasks that rely on nucleotide level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.

摘要

基因组语言模型最近作为一种解码、解释和生成基因序列的新方法出现。现有的基因组语言模型采用了各种分词方法,包括字符分词、重叠和非重叠的k-mer分词,以及字节对编码,这是一种在自然语言模型中广泛使用的方法。基因组序列与自然语言不同,因为它们的字符变异性低、特征复杂且重叠,以及方向性不一致。这些特征使得基因组语言模型中的子词分词与传统语言模型和蛋白质语言模型都有显著差异。本研究通过评估基因组语言模型在44个分类微调任务上的下游性能,探讨了分词在基因组语言模型中的影响。我们还在状态空间模型Mamba中对字节对编码和字符分词进行了直接比较。我们的结果表明,在依赖核苷酸水平分辨率的任务上,如剪接位点预测和启动子检测,字符分词优于子词分词方法。虽然字节对分词在SARS-CoV-2变体分类任务上表现更强,但我们观察到在其余下游任务上,分词方法之间的统计显著差异有限。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/9a3e4a019b84/nihpp-2024.09.09.612081v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/caf877342e08/nihpp-2024.09.09.612081v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/6a2bfcf03935/nihpp-2024.09.09.612081v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/9a3e4a019b84/nihpp-2024.09.09.612081v3-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/caf877342e08/nihpp-2024.09.09.612081v3-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/6a2bfcf03935/nihpp-2024.09.09.612081v3-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6b7/12330496/9a3e4a019b84/nihpp-2024.09.09.612081v3-f0003.jpg

相似文献

1
The Impact of Tokenizer Selection in Genomic Language Models.基因组语言模型中分词器选择的影响
bioRxiv. 2025 Jul 26:2024.09.09.612081. doi: 10.1101/2024.09.09.612081.
2
The Impact of Tokenizer Selection in Genomic Language Models.基因组语言模型中分词器选择的影响
Bioinformatics. 2025 Aug 18. doi: 10.1093/bioinformatics/btaf456.
3
Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.用于植物基因组注释和调控元件强度预测的采用k-mer分词策略的基因组语言模型。
Plant Mol Biol. 2025 Jul 31;115(4):100. doi: 10.1007/s11103-025-01604-7.
4
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
Short-Term Memory Impairment短期记忆障碍
7
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
8
The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.样本采集部位和采集程序对严重急性呼吸综合征冠状病毒2(SARS-CoV-2)感染鉴定的影响。
Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.
9
Transfer learning driven fake news detection and classification using large language models.使用大语言模型的迁移学习驱动的假新闻检测与分类
Sci Rep. 2025 Aug 5;15(1):28490. doi: 10.1038/s41598-025-10670-2.
10
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。
Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

本文引用的文献

1
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.墨丘利神杖:双向等变远程DNA序列建模
Proc Mach Learn Res. 2024 Jul;235:43632-43648.
2
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。
Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.
3
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
4
JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles.JASPAR 2024:转录因子结合谱开放获取数据库的 20 周年纪念
Nucleic Acids Res. 2024 Jan 5;52(D1):D174-D182. doi: 10.1093/nar/gkad1059.
5
Genomic benchmarks: a collection of datasets for genomic sequence classification.基因组基准测试:一组用于基因组序列分类的数据集。
BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.
6
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
7
Ensembl 2021.Ensembl 2021.
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891. doi: 10.1093/nar/gkaa942.
8
SpliceFinder: ab initio prediction of splice sites using convolutional neural network.SpliceFinder:使用卷积神经网络进行剪接位点的从头预测。
BMC Bioinformatics. 2019 Dec 27;20(Suppl 23):652. doi: 10.1186/s12859-019-3306-3.
9
The ensembl regulatory build.Ensembl调控构建
Genome Biol. 2015 Mar 24;16(1):56. doi: 10.1186/s13059-015-0621-5.
10
Transcription factors: from enhancer binding to developmental control.转录因子:从增强子结合到发育控制。
Nat Rev Genet. 2012 Sep;13(9):613-26. doi: 10.1038/nrg3207. Epub 2012 Aug 7.