Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland.
Centre Hospitalier Universitaire Vaudois/Ludwig Institute for Cancer Research, Agora Centre, CH-1005 Lausanne, Switzerland.
Gigascience. 2020 Feb 1;9(2). doi: 10.1093/gigascience/giaa003.
Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation.
Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline.
HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.
基因组和蛋白质组注释管道通常是定制的,其他组不容易重复使用。这导致了重复的工作、增加了成本和次优的注释质量。解决这些问题的一种方法是鼓励采用注释标准和技术解决方案,使生物知识和基因组和蛋白质组注释工具能够共享。
这里我们展示了一种生成可移植的基因组和蛋白质组注释管道的方法,用户无需使用定制软件即可运行。这个概念验证使用了我们自己的基于规则的注释管道 HAMAP,它为蛋白质序列提供了与 UniProtKB/Swiss-Prot 相同深度和质量的功能注释,以及万维网联盟 (W3C) 标准资源描述框架 (RDF) 和 SPARQL(SPARQL 协议和 RDF 查询语言的递归缩写)。我们将复杂的 HAMAP 规则转换为 W3C 标准 SPARQL 1.1 语法,然后使用免费提供的 SPARQL 引擎将它们应用于 RDF 格式的蛋白质序列。这种方法支持使用标准的现成解决方案生成与我们自己的内部管道生成的注释相同的注释,并且适用于任何基因组或蛋白质组注释管道。
HAMAP SPARQL 规则可从 HAMAP FTP 站点下载,网址为 ftp://ftp.expasy.org/databases/hamap/sparql/,根据 CC-BY-ND 4.0 许可证获得许可。规则生成的注释根据 CC-BY 4.0 许可证获得许可。有关使用 HAMAP 作为 SPARQL 的教程和补充代码可在 GitHub 上的 https://github.com/sib-swiss/HAMAP-SPARQL 上获得,有关 HAMAP 的一般文档可在 HAMAP 网站上获得,网址为 https://hamap.expasy.org。