Akotenou Genereux, El Allali Achraf
Bioinformatics Laboratory, College of Computing, University Mohammed VI Polytechnic, Lot 660, Hay Moulay Rachid, Ben Guerir 43150, Morocco.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf311.
Accurate bacterial gene prediction is essential for understanding microbial functions and advancing biotechnology. Traditional methods based on sequence homology and statistical models often struggle with complex genetic variations and novel sequences due to their limited ability to interpret the "language of genes." To overcome these challenges, we explore genomic language models (gLMs)-inspired by large language models in natural language processing-to enhance bacterial gene prediction. These models learn patterns and contextual dependencies within genetic sequences, similar to how LLMs process human language. We employ transformers, specifically DNABERT, for bacterial gene prediction using a two-stage framework: first, identifying coding sequence (CDS) regions, and then refining predictions by identifying the correct translation initiation sites (TIS). DNABERT is fine-tuned on a curated set of NCBI complete bacterial genomes using a k-mer tokenizer for sequence processing. Our results show that GeneLM significantly improves gene prediction accuracy. Compared with the leading prokaryotic gene finders, Prodigal, GeneMark-HMM, and Glimmer, and other recent deep learning methods, GeneLM reduces missed CDS predictions while increasing matched annotations. More notably, our TIS predictions surpass traditional methods when tested against experimentally verified sites. GeneLM demonstrates the power of gLMs in decoding genetic information, achieving state-of-the-art performance in bacterial genome analysis. This advancement highlights the potential of language models to revolutionize genome annotation, outperforming conventional tools and enabling more precise genetic insights.
准确的细菌基因预测对于理解微生物功能和推动生物技术发展至关重要。基于序列同源性和统计模型的传统方法,由于其解读“基因语言”的能力有限,在面对复杂的基因变异和新序列时常常遇到困难。为了克服这些挑战,我们探索了受自然语言处理中的大语言模型启发的基因组语言模型(gLMs),以增强细菌基因预测。这些模型学习遗传序列中的模式和上下文依赖性,类似于大语言模型处理人类语言的方式。我们采用变换器,特别是DNABERT,使用两阶段框架进行细菌基因预测:首先,识别编码序列(CDS)区域,然后通过识别正确的翻译起始位点(TIS)来优化预测。DNABERT使用k-mer分词器对一组精心策划的NCBI完整细菌基因组进行微调,用于序列处理。我们的结果表明,GeneLM显著提高了基因预测的准确性。与领先的原核基因发现工具Prodigal、GeneMark-HMM和Glimmer以及其他最近的深度学习方法相比,GeneLM减少了CDS预测的遗漏,同时增加了匹配的注释。更值得注意的是,当针对经过实验验证的位点进行测试时,我们的TIS预测超过了传统方法。GeneLM展示了gLMs在解码遗传信息方面的强大能力,在细菌基因组分析中实现了最先进的性能。这一进展凸显了语言模型在革新基因组注释方面的潜力,超越了传统工具,能够提供更精确的遗传见解。