Suppr超能文献

通过对比优化增强基因组分析中的核苷酸序列表示。

Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization.

作者信息

Refahi Mohammadsaleh, Sokhansanj Bahrad A, Mell Joshua C, Brown James R, Yoo Hyunwoo, Hearne Gavin, Rosen Gail L

机构信息

Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.

College of Medicine, Drexel University, Philadelphia, PA, USA.

出版信息

Commun Biol. 2025 Mar 29;8(1):517. doi: 10.1038/s42003-025-07902-6.

Abstract

Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.

摘要

由于进化相关的核苷酸序列之间的差异更大、不同物种基因组内部和之间可变的k-mer和密码子使用情况,以及人们对选择性限制了解不足,基因组和宏基因组序列分析本质上比氨基酸序列分析更具挑战性。我们引入了Scorpio(用于DNA表示和预测推理的序列对比优化),这是一个为核苷酸序列设计的通用框架,它采用对比学习来改进嵌入。通过利用预训练的基因组语言模型和k-mer频率嵌入,Scorpio在多种应用中展现出有竞争力的性能,包括分类学和基因分类、抗菌药物耐药性(AMR)基因鉴定以及启动子检测。Scorpio的一个关键优势在于它能够推广到新的DNA序列和分类群,解决了基于比对方法的一个重大局限性。Scorpio已经在多个包含不同长度(长和短)DNA序列的数据集上进行了测试,并显示出强大的推理能力。此外,我们对这种表示背后的生物学信息进行了分析,包括作为基因表达因子的密码子适应指数、序列相似性和分类学之间的相关性,以及基因的功能和结构信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/77b7/11953366/639563171cbd/42003_2025_7902_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验