PASS：使用同源簇、自然语言处理和序列相似性网络进行蛋白质注释的蛋白质注释监测站点。

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks.

作者信息

Tao Jin, Brayton Kelly A, Broschat Shira L

机构信息

School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.

Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.

出版信息

Front Bioinform. 2021 Sep 29;1:749008. doi: 10.3389/fbinf.2021.749008. eCollection 2021.

DOI:10.3389/fbinf.2021.749008

PMID:36303767

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9581018/

Abstract

Advances in genome sequencing have accelerated the growth of sequenced genomes but at a cost in the quality of genome annotation. At the same time, computational analysis is widely used for protein annotation, but a dearth of experimental verification has contributed to inaccurate annotation as well as to annotation error propagation. Thus, a tool to help life scientists with accurate protein annotation would be useful. In this work we describe a website we have developed, the Protein Annotation Surveillance Site (PASS), which provides such a tool. This website consists of three major components: a database of homologous clusters of more than eight million protein sequences deduced from the representative genomes of bacteria, archaea, eukarya, and viruses, together with sequence information; a machine-learning software tool which periodically queries the UniprotKB database to determine whether protein function has been experimentally verified; and a query-able webpage where the FASTA headers of sequences from the cluster best matching an input sequence are returned. The user can choose from these sequences to create a sequence similarity network to assist in annotation or else use their expert knowledge to choose an annotation from the cluster sequences. Illustrations demonstrating use of this website are presented.

摘要

基因组测序技术的进步加速了已测序基因组数量的增长，但代价是基因组注释质量下降。与此同时，计算分析广泛用于蛋白质注释，但缺乏实验验证导致注释不准确以及注释错误传播。因此，开发一种有助于生命科学家进行准确蛋白质注释的工具将很有用。在这项工作中，我们描述了一个已开发的网站——蛋白质注释监测网站（PASS），它提供了这样一种工具。该网站由三个主要部分组成：一个包含从细菌、古菌、真核生物和病毒的代表性基因组推导出来的超过八百万个蛋白质序列的同源簇数据库，以及序列信息；一个机器学习软件工具，它定期查询UniProtKB数据库以确定蛋白质功能是否已通过实验验证；还有一个可查询的网页，返回与输入序列最匹配的簇中序列的FASTA标题。用户可以从这些序列中选择以创建序列相似性网络来辅助注释，或者利用他们的专业知识从簇序列中选择注释。本文展示了该网站使用方法的示例。