Suppr超能文献

DeephageTP:一种从宏基因组测序数据中识别噬菌体特异性蛋白的卷积神经网络框架。

DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data.

机构信息

Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China.

Department of Bioengineering and Biotechnology, Huaqiao University, Xiamen, Fujian, P.R. China.

出版信息

PeerJ. 2022 Jun 8;10:e13404. doi: 10.7717/peerj.13404. eCollection 2022.

Abstract

Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50-90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.

摘要

噬菌体(phages)是地球上最丰富和最多样化的生物实体。由于缺乏通用的基因标记和数据库代表,噬菌体约有 50-90%的基因无法赋予功能。这使得通过大规模同源搜索有效地识别噬菌体基因组和注释噬菌体基因的功能成为一项挑战,尤其是对于新噬菌体。门户(portal 蛋白)、TerL(大型终止酶亚基蛋白)和 TerS(小终止酶亚基蛋白)是长尾病毒目噬菌体的三种特定蛋白质。在这里,我们开发了一种基于卷积神经网络(CNN)的框架 DeephageTP,用于从宏基因组数据中识别这三种特定蛋白质。该框架将原始蛋白质序列的独热编码数据作为输入,并在建模过程中自动提取预测特征。为了克服假阳性问题,根据同一类别内蛋白质序列的损失值分布,引入了一个截止损失值策略。具有一组截止损失值的所提出的模型在从模拟宏基因组数据集中识别 TerL 和门户序列的精度方面表现出很高的性能(分别为 94%和 90%)。最后,我们使用三个真实的宏基因组数据集测试了该框架的功效,结果表明与传统的基于比对的方法相比,我们提出的框架在识别具有远程同源性的新型噬菌体特异性门户和 TerL 蛋白序列方面具有特殊优势,其同源性来自训练数据集。总之,我们的研究首次开发了一种基于 CNN 的框架,用于识别具有高复杂性和低保守性的噬菌体特异性蛋白质序列,该框架将有助于我们在宏基因组测序数据中发现新的噬菌体。DeephageTP 可在 https://github.com/chuym726/DeephageTP 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/097a/9188312/3b37c6bb7137/peerj-10-13404-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验