Li Fuyu, Lu Wenxiang, Bai Yunfei
State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
Front Genet. 2025 Jun 26;16:1641162. doi: 10.3389/fgene.2025.1641162. eCollection 2025.
Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.
Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.
HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10 as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.
HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.
染色体外环状DNA(eccDNA)是一类源自染色体的环状DNA分子,在疾病中具有多种作用。长eccDNA(通常为1-5 kb)因其尺寸较大,给检测带来挑战,阻碍了功能研究。我们提出了HyenaCircle,这是一种利用大语言模型和第三代测序数据来预测长eccDNA形成的新型深度学习模型。
通过FLED算法从纳米孔测序数据中鉴定出1-5 kb范围内的全长eccDNA,将其侧翼序列扩展100 bp,并与来自eccDNA缺失基因组区域的20,000个长度匹配的阴性对照配对。HyenaCircle是通过调整预训练的HyenaDNA模型并设计分类器头构建而成。应用数据增强、正则化和类不平衡加权策略来提高模型的鲁棒性。
HyenaCircle取得了相当的性能,验证AUROC为0.715,召回率为0.776。其在AUROC上比DNABERT高出5.9%,并显示出稳定的收敛性。超参数优化确定批量大小为16,学习率为5×10为最优。消融研究表明侧翼序列很重要,去除它们会降低模型稳定性。该模型在基线HyenaDNA架构上也表现出卓越稳定性。
HyenaCircle整合第三代测序数据和大语言模型用于长eccDNA预测,优于现有模型。我们的工作表明HyenaDNA架构能够实现有效的长序列基因组建模,并为eccDNA预测和鉴定提供了新的见解。