Suppr超能文献

鬣狗圈:一种基于鬣狗DNA的预训练大型语言模型,用于长链染色体外环状DNA预测。

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction.

作者信息

Li Fuyu, Lu Wenxiang, Bai Yunfei

机构信息

State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.

出版信息

Front Genet. 2025 Jun 26;16:1641162. doi: 10.3389/fgene.2025.1641162. eCollection 2025.

Abstract

INTRODUCTION

Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.

METHODS

Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.

RESULTS

HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10 as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.

CONCLUSION

HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.

摘要

引言

染色体外环状DNA(eccDNA)是一类源自染色体的环状DNA分子,在疾病中具有多种作用。长eccDNA(通常为1-5 kb)因其尺寸较大,给检测带来挑战,阻碍了功能研究。我们提出了HyenaCircle,这是一种利用大语言模型和第三代测序数据来预测长eccDNA形成的新型深度学习模型。

方法

通过FLED算法从纳米孔测序数据中鉴定出1-5 kb范围内的全长eccDNA,将其侧翼序列扩展100 bp,并与来自eccDNA缺失基因组区域的20,000个长度匹配的阴性对照配对。HyenaCircle是通过调整预训练的HyenaDNA模型并设计分类器头构建而成。应用数据增强、正则化和类不平衡加权策略来提高模型的鲁棒性。

结果

HyenaCircle取得了相当的性能,验证AUROC为0.715,召回率为0.776。其在AUROC上比DNABERT高出5.9%,并显示出稳定的收敛性。超参数优化确定批量大小为16,学习率为5×10为最优。消融研究表明侧翼序列很重要,去除它们会降低模型稳定性。该模型在基线HyenaDNA架构上也表现出卓越稳定性。

结论

HyenaCircle整合第三代测序数据和大语言模型用于长eccDNA预测,优于现有模型。我们的工作表明HyenaDNA架构能够实现有效的长序列基因组建模,并为eccDNA预测和鉴定提供了新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6f79/12240936/17fec3c189fc/fgene-16-1641162-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验