Suppr超能文献

PCVR:用于DNA序列分类的预训练情境化视觉表征

PCVR: a pre-trained contextualized visual representation for DNA sequence classification.

作者信息

Zhou Jiarui, Wu Hui, Du Kang, Zhou Wengang, Zhou Cong-Zhao, Li Houqiang

机构信息

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.

Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.

出版信息

BMC Bioinformatics. 2025 May 9;26(1):125. doi: 10.1186/s12859-025-06136-x.

Abstract

BACKGROUND

The classification of DNA sequences is pivotal in bioinformatics, essentially for genetic information analysis. Traditional alignment-based tools tend to have slow speed and low recall. Machine learning methods learn implicit patterns from data with encoding techniques such as k-mer counting and ordinal encoding, which fail to handle long sequences or sacrifice structural and sequential information. Frequency chaos game representation (FCGR) converts DNA sequences of arbitrary lengths into fixed-size images, breaking free from the constraints of sequence length while preserving more sequential information than other representations. However, existing works merely consider local information, ignoring long-range dependencies and global contextual information within FCGR image.

RESULTS

We propose PCVR, a Pre-trained Contextualized Visual Representation for DNA sequence classification. PCVR encodes FCGR with a vision transformer into contextualized features containing more global information. To meet the substantial data requirements of the training of vision transformer and learn more robust features, we pre-train the encoder with a masked autoencoder. Pre-trained PCVR exhibits impressive performance on three datasets even with only unsupervised learning. After fine-tuning, PCVR outperforms existing methods on superkingdom and phylum levels. Additionally, our ablation studies confirm the contribution of the vision transformer encoder and masked autoencoder pre-training to performance improvement.

CONCLUSIONS

PCVR significantly improves DNA sequence classification accuracy and shows strong potential for new species discovery due to its effective capture of global information and robustness. Codes for PCVR are available at https://github.com/jiaruizhou/PCVR .

摘要

背景

DNA序列分类在生物信息学中至关重要,主要用于遗传信息分析。传统的基于比对的工具往往速度慢且召回率低。机器学习方法通过k-mer计数和序数编码等编码技术从数据中学习隐含模式,但这些方法难以处理长序列,或者会牺牲结构和序列信息。频率混沌游戏表示(FCGR)将任意长度的DNA序列转换为固定大小的图像,摆脱了序列长度的限制,同时比其他表示方式保留了更多的序列信息。然而,现有工作仅考虑局部信息,忽略了FCGR图像中的长程依赖性和全局上下文信息。

结果

我们提出了PCVR,一种用于DNA序列分类的预训练上下文视觉表示方法。PCVR使用视觉Transformer对FCGR进行编码,以生成包含更多全局信息的上下文特征。为了满足视觉Transformer训练的大量数据需求并学习更强大的特征,我们使用掩码自动编码器对编码器进行预训练。即使仅进行无监督学习,预训练的PCVR在三个数据集上也表现出令人印象深刻的性能。经过微调后,PCVR在界和门水平上优于现有方法。此外,我们的消融研究证实了视觉Transformer编码器和掩码自动编码器预训练对性能提升的贡献。

结论

PCVR显著提高了DNA序列分类的准确性,并且由于其有效捕获全局信息和鲁棒性,在新物种发现方面显示出强大的潜力。PCVR的代码可在https://github.com/jiaruizhou/PCVR获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b24/12065381/2fb33a2bd8db/12859_2025_6136_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验