Biocommons, San Francisco, CA, United States of America.
Invitae, Inc., San Francisco, CA, United States of America.
PLoS One. 2020 Dec 3;15(12):e0239883. doi: 10.1371/journal.pone.0239883. eCollection 2020.
Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility.
Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets.
SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.
访问生物序列数据,如基因组、转录本或蛋白质序列,是许多生物信息学分析工作流程的核心。国家生物技术信息中心 (NCBI)、Ensembl 和其他序列数据库维护者提供了通过网络连接访问序列的方法。对于许多用户来说,远程管理数据的便利性和及时性是非常有吸引力的,网络延迟也无关紧要。然而,对于高通量和临床应用,本地序列集对于性能、稳定性、隐私和可重复性至关重要。
在这里,我们描述了 SeqRepo,这是一种构建本地、高性能、无冗余生物序列集合的新系统。SeqRepo 允许客户端使用主要数据库标识符和几个摘要来识别序列和序列别名。SeqRepo 提供了一个本地 Python 接口和一个 REST 接口,它可以在本地运行,并支持来自其他编程语言的访问。SeqRepo 还提供了基于 GA4GH refget 协议的替代 REST 接口。SeqRepo 提供了对序列切片的快速随机访问。我们提供的结果表明,与远程序列集合相比,本地 SeqRepo 序列集合的性能优势高达 1300 倍。在我们用于变体验证和标准化管道的用例中,SeqRepo 相对于使用远程序列将吞吐量提高了 50 倍。SeqRepo 可用于任何物种或序列类型。定期提供人类序列集合的快照。使用计算摘要作为序列标识符通常很方便或必要。例如,摘要标识符可用于引用专有的参考基因组或图基因组的片段,对于这些片段,常规标识符将不可用。在这里,我们还引入了一种应用 SHA-512 哈希算法和 Base64 编码生成 URL 安全标识符的约定。这种约定,sha512t24u,结合了快速摘要机制和空间高效表示,可以用于任何对象。我们的报告包括对 sha512t24u 的时间和冲突概率的分析。SeqRepo 允许客户端使用 sha512t24u 作为标识符,从而无缝集成公共和私有序列集。
SeqRepo 是在 Apache License 2.0 下发布的,可以在 github 和 PyPi 上获得。还提供了 Docker 映像和数据库快照。请访问 https://github.com/biocommons/biocommons.seqrepo。