Suppr超能文献

PhANNs,一个快速准确的工具和网络服务器,用于分类噬菌体结构蛋白。

PhANNs, a fast and accurate tool and web server to classify phage structural proteins.

机构信息

Computational Science Research Center, San Diego State University, San Diego, United States of America.

Viral Information Institute, San Diego State University, San Diego, United States of America.

出版信息

PLoS Comput Biol. 2020 Nov 2;16(11):e1007845. doi: 10.1371/journal.pcbi.1007845. eCollection 2020 Nov.

Abstract

For any given bacteriophage genome or phage-derived sequences in metagenomic data sets, we are unable to assign a function to 50-90% of genes, or more. Structural protein-encoding genes constitute a large fraction of the average phage genome and are among the most divergent and difficult-to-identify genes using homology-based methods. To understand the functions encoded by phages, their contributions to their environments, and to help gauge their utility as potential phage therapy agents, we have developed a new approach to classify phage ORFs into ten major classes of structural proteins or into an "other" category. The resulting tool is named PhANNs (Phage Artificial Neural Networks). We built a database of 538,213 manually curated phage protein sequences that we split into eleven subsets (10 for cross-validation, one for testing) using a novel clustering method that ensures there are no homologous proteins between sets yet maintains the maximum sequence diversity for training. An Artificial Neural Network ensemble trained on features extracted from those sets reached a test F1-score of 0.875 and test accuracy of 86.2%. PhANNs can rapidly classify proteins into one of the ten structural classes or, if not predicted to fall in one of the ten classes, as "other," providing a new approach for functional annotation of phage proteins. PhANNs is open source and can be run from our web server or installed locally.

摘要

对于任何给定的噬菌体基因组或宏基因组数据集中的噬菌体衍生序列,我们无法为 50-90%或更多的基因赋予功能。结构蛋白编码基因构成了平均噬菌体基因组的很大一部分,并且是使用基于同源性的方法最具差异和最难识别的基因之一。为了了解噬菌体编码的功能、它们对环境的贡献,并帮助评估它们作为潜在噬菌体治疗剂的用途,我们开发了一种将噬菌体 ORF 分类为十大结构蛋白类或“其他”类别的新方法。由此产生的工具名为 PhANNs(噬菌体人工神经网络)。我们构建了一个包含 538,213 个经过人工精心整理的噬菌体蛋白序列的数据库,我们使用一种新颖的聚类方法将其分为十一个子集(十个用于交叉验证,一个用于测试),该方法确保在集合之间没有同源蛋白,但同时为训练保持最大的序列多样性。基于从这些集合中提取的特征训练的人工神经网络集成在测试中达到了 0.875 的 F1 分数和 86.2%的测试准确性。PhANNs 可以快速将蛋白质分类到十大结构类之一,如果不属于十大类之一,则将其归类为“其他”,为噬菌体蛋白的功能注释提供了一种新方法。PhANNs 是开源的,可以从我们的网络服务器运行或本地安装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7cbd/7660903/6c67c887a3df/pcbi.1007845.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验