鬣狗圈：一种基于鬣狗DNA的预训练大型语言模型，用于长链染色体外环状DNA预测。

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction.

作者信息

Li Fuyu, Lu Wenxiang, Bai Yunfei

机构信息

State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.

出版信息

Front Genet. 2025 Jun 26;16:1641162. doi: 10.3389/fgene.2025.1641162. eCollection 2025.

DOI:10.3389/fgene.2025.1641162

PMID:40641599

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12240936/

Abstract

INTRODUCTION

Extrachromosomal circular DNA (eccDNA) represents a class of circular DNA molecules derived from chromosomes with diverse roles in disease. Long eccDNAs (typically 1-5 kb) pose detection challenges due to their large size, hindering functional studies. We propose HyenaCircle, a novel deep learning model leveraging large language model and third-generation sequencing data to predict long eccDNA formation.

METHODS

Full-length eccDNAs within 1-5 kb were identified by FLED algorithm for Nanopore sequencing data, extended by 100-bp flanking sequences, and paired with 20,000 length-matched negative controls from eccDNA-depleted genomic regions. HyenaCircle was built by adapting the pretrained HyenaDNA model with a designed classifier head. The strategies of data augmentation, regularization and class imbalance weighting were applied to increase model robustness.

RESULTS

HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10 as optimal. The ablation studies revealed flanking sequences are important, as their removal reduced model stability. The model also showed superior stability over the baseline HyenaDNA architecture.

CONCLUSION

HyenaCircle integrated third-generation sequencing data and large language model for long eccDNA prediction, which outperformed the existing model. Our work demonstrates that the HyenaDNA architecture enables effective long-sequence genomic modeling and provides a new insight for eccDNA prediction and identification.

摘要

引言

染色体外环状DNA（eccDNA）是一类源自染色体的环状DNA分子，在疾病中具有多种作用。长eccDNA（通常为1-5 kb）因其尺寸较大，给检测带来挑战，阻碍了功能研究。我们提出了HyenaCircle，这是一种利用大语言模型和第三代测序数据来预测长eccDNA形成的新型深度学习模型。

方法

通过FLED算法从纳米孔测序数据中鉴定出1-5 kb范围内的全长eccDNA，将其侧翼序列扩展100 bp，并与来自eccDNA缺失基因组区域的20,000个长度匹配的阴性对照配对。HyenaCircle是通过调整预训练的HyenaDNA模型并设计分类器头构建而成。应用数据增强、正则化和类不平衡加权策略来提高模型的鲁棒性。

结果

HyenaCircle取得了相当的性能，验证AUROC为0.715，召回率为0.776。其在AUROC上比DNABERT高出5.9%，并显示出稳定的收敛性。超参数优化确定批量大小为16，学习率为5×10为最优。消融研究表明侧翼序列很重要，去除它们会降低模型稳定性。该模型在基线HyenaDNA架构上也表现出卓越稳定性。

结论

HyenaCircle整合第三代测序数据和大语言模型用于长eccDNA预测，优于现有模型。我们的工作表明HyenaDNA架构能够实现有效的长序列基因组建模，并为eccDNA预测和鉴定提供了新的见解。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

鬣狗圈：一种基于鬣狗DNA的预训练大型语言模型，用于长链染色体外环状DNA预测。

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

引言

方法

结果

结论

相似文献

本文引用的文献

鬣狗圈：一种基于鬣狗DNA的预训练大型语言模型，用于长链染色体外环状DNA预测。

HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

CONCLUSION

引言

方法

结果

结论

相似文献

本文引用的文献