Li Jing, Mi Jia, Lin Wei, Tian Fengjuan, Wan Jing, Gao Jingyang, Tong Yigang
The College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China.
The College of Life Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf224.
Viruses are ubiquitous in nature, yet our understanding of them remains limited. High-throughput sequencing technology facilitates the unbiased revelation of genetic composition in samples; however, viral sequences typically make up a small proportion of the entire sequencing data, making it challenging to accurately identify the few or fragmented viral sequences present in a sample. The limited features and information provided by short sequences result in insufficient resolution of viral sequences by existing models. Therefore, we propose a new model, VirNucPro, for short viral sequence identification. Based on a six-frame translation strategy and large language models, we combine nucleotide and amino acid sequence information to enhance feature extraction for short sequences, achieving high accuracy in identifying short viral sequences. Ablation experiments compared the contributions of nucleotide and amino acid sequence features to the model, confirming that the introduced amino acid features significantly contribute to the classification results. Our model outperforms others, such as GCNFrame, DeepVirFinder, DETIRE, and Virtifier, which have demonstrated good performance in identifying short viral sequences of 300 and 500 bp. Our model demonstrates excellent performance on carefully created real-world datasets. Additionally, it can scan for prophage regions within long bacterial fragments, offering a wide range of applications. The codes are available at: https://github.com/Li-Jing-1997/VirNucPro.
病毒在自然界中无处不在,但我们对它们的了解仍然有限。高通量测序技术有助于无偏见地揭示样本中的基因组成;然而,病毒序列通常只占整个测序数据的一小部分,这使得准确识别样本中存在的少数或片段化病毒序列具有挑战性。短序列提供的有限特征和信息导致现有模型对病毒序列的分辨率不足。因此,我们提出了一种新的模型VirNucPro,用于短病毒序列识别。基于六框架翻译策略和大语言模型,我们结合核苷酸和氨基酸序列信息来增强短序列的特征提取,在识别短病毒序列方面实现了高精度。消融实验比较了核苷酸和氨基酸序列特征对模型的贡献,证实引入的氨基酸特征对分类结果有显著贡献。我们的模型优于其他模型,如GCNFrame、DeepVirFinder、DETIRE和Virtifier,这些模型在识别300和500bp的短病毒序列方面表现良好。我们的模型在精心创建的真实数据集上表现出色。此外,它可以扫描长细菌片段中的前噬菌体区域,具有广泛的应用。代码可在以下网址获取:https://github.com/Li-Jing-1997/VirNucPro 。