• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SeqRepo:一个用于管理生物序列本地集合的系统。

SeqRepo: A system for managing local collections of biological sequences.

机构信息

Biocommons, San Francisco, CA, United States of America.

Invitae, Inc., San Francisco, CA, United States of America.

出版信息

PLoS One. 2020 Dec 3;15(12):e0239883. doi: 10.1371/journal.pone.0239883. eCollection 2020.

DOI:10.1371/journal.pone.0239883
PMID:33270643
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7714221/
Abstract

MOTIVATION

Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility.

RESULTS

Here we describe SeqRepo, a novel system for building a local, high-performance, non-redundant collection of biological sequences. SeqRepo enables clients to use primary database identifiers and several digests to identify sequences and sequence alises. SeqRepo provides a native Python interface and a REST interface, which can run locally and enables access from other programming languages. SeqRepo also provides an alternative REST interface based on the GA4GH refget protocol. SeqRepo provides fast random access to sequence slices. We provide results that demonstrate that a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections. In our use case for a variant validation and normalization pipeline, SeqRepo improved throughput 50-fold relative to use with remote sequences. SeqRepo may be used with any species or sequence type. Regular snapshots of Human sequence collections are available. It is often convenient or necessary to use a computed digest as a sequence identifier. For example, a digest-based identifier may be used to refer to proprietary reference genomes or segments of a graph genome, for which conventional identifiers will not be available. Here we also introduce a convention for the application of the SHA-512 hashing algorithm with Base64 encoding to generate URL-safe identifiers. This convention, sha512t24u, combines a fast digest mechanism with a space-efficient representation that can be used for any object. Our report includes an analysis of timing and collision probabilities for sha512t24u. SeqRepo enables clients to use sha512t24u as identifiers, thereby seamlessly integrating public and private sequence sets.

AVAILABILITY

SeqRepo is released under the Apache License 2.0 and is available on github and PyPi. Docker images and database snapshots are also available. See https://github.com/biocommons/biocommons.seqrepo.

摘要

动机

访问生物序列数据,如基因组、转录本或蛋白质序列,是许多生物信息学分析工作流程的核心。国家生物技术信息中心 (NCBI)、Ensembl 和其他序列数据库维护者提供了通过网络连接访问序列的方法。对于许多用户来说,远程管理数据的便利性和及时性是非常有吸引力的,网络延迟也无关紧要。然而,对于高通量和临床应用,本地序列集对于性能、稳定性、隐私和可重复性至关重要。

结果

在这里,我们描述了 SeqRepo,这是一种构建本地、高性能、无冗余生物序列集合的新系统。SeqRepo 允许客户端使用主要数据库标识符和几个摘要来识别序列和序列别名。SeqRepo 提供了一个本地 Python 接口和一个 REST 接口,它可以在本地运行,并支持来自其他编程语言的访问。SeqRepo 还提供了基于 GA4GH refget 协议的替代 REST 接口。SeqRepo 提供了对序列切片的快速随机访问。我们提供的结果表明,与远程序列集合相比,本地 SeqRepo 序列集合的性能优势高达 1300 倍。在我们用于变体验证和标准化管道的用例中,SeqRepo 相对于使用远程序列将吞吐量提高了 50 倍。SeqRepo 可用于任何物种或序列类型。定期提供人类序列集合的快照。使用计算摘要作为序列标识符通常很方便或必要。例如,摘要标识符可用于引用专有的参考基因组或图基因组的片段,对于这些片段,常规标识符将不可用。在这里,我们还引入了一种应用 SHA-512 哈希算法和 Base64 编码生成 URL 安全标识符的约定。这种约定,sha512t24u,结合了快速摘要机制和空间高效表示,可以用于任何对象。我们的报告包括对 sha512t24u 的时间和冲突概率的分析。SeqRepo 允许客户端使用 sha512t24u 作为标识符,从而无缝集成公共和私有序列集。

可用性

SeqRepo 是在 Apache License 2.0 下发布的,可以在 github 和 PyPi 上获得。还提供了 Docker 映像和数据库快照。请访问 https://github.com/biocommons/biocommons.seqrepo。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/a2286e665229/pone.0239883.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/e804326c3e0c/pone.0239883.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/73d10784e8e5/pone.0239883.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/a2286e665229/pone.0239883.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/e804326c3e0c/pone.0239883.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/73d10784e8e5/pone.0239883.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2d0/7714221/a2286e665229/pone.0239883.g003.jpg

相似文献

1
SeqRepo: A system for managing local collections of biological sequences.SeqRepo:一个用于管理生物序列本地集合的系统。
PLoS One. 2020 Dec 3;15(12):e0239883. doi: 10.1371/journal.pone.0239883. eCollection 2020.
2
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
3
The Ensembl REST API: Ensembl Data for Any Language.Ensembl REST应用程序编程接口:适用于任何语言的Ensembl数据。
Bioinformatics. 2015 Jan 1;31(1):143-5. doi: 10.1093/bioinformatics/btu613. Epub 2014 Sep 17.
4
JUICE: a data management system that facilitates the analysis of large volumes of information in an EST project workflow.JUICE:一个数据管理系统,可在EST项目工作流程中促进对大量信息的分析。
BMC Bioinformatics. 2006 Nov 23;7:513. doi: 10.1186/1471-2105-7-513.
5
htsint: a Python library for sequencing pipelines that combines data through gene set generation.htsint:一个用于测序流程的Python库,通过基因集生成来整合数据。
BMC Bioinformatics. 2015 Sep 24;16:307. doi: 10.1186/s12859-015-0729-3.
6
The Gene Set Builder: collation, curation, and distribution of sets of genes.基因集构建器:基因集的整理、管理与分发。
BMC Bioinformatics. 2005 Dec 21;6:305. doi: 10.1186/1471-2105-6-305.
7
AlgoRun: a Docker-based packaging system for platform-agnostic implemented algorithms.AlgoRun:一种用于与平台无关的已实现算法的基于Docker的打包系统。
Bioinformatics. 2016 Aug 1;32(15):2396-8. doi: 10.1093/bioinformatics/btw120. Epub 2016 Mar 2.
8
Remote access to ACNUC nucleotide and protein sequence databases at PBIL.远程访问PBIL的ACNUC核苷酸和蛋白质序列数据库。
Biochimie. 2008 Apr;90(4):555-62. doi: 10.1016/j.biochi.2007.07.003. Epub 2007 Jul 15.
9
Sequence database versioning for command line and Galaxy bioinformatics servers.用于命令行和Galaxy生物信息学服务器的序列数据库版本控制。
Bioinformatics. 2016 Apr 15;32(8):1275-7. doi: 10.1093/bioinformatics/btv724. Epub 2015 Dec 12.
10
PERF: an exhaustive algorithm for ultra-fast and efficient identification of microsatellites from large DNA sequences.PERF:一种从大型 DNA 序列中进行超快速和高效微卫星识别的穷举算法。
Bioinformatics. 2018 Mar 15;34(6):943-948. doi: 10.1093/bioinformatics/btx721.

引用本文的文献

1
Mapping MAVE data for use in human genomics applications.映射用于人类基因组学应用的MAVE数据。
Genome Biol. 2025 Jun 25;26(1):179. doi: 10.1186/s13059-025-03647-x.
2
MaveDB 2024: a curated community database with over seven million variant effects from multiplexed functional assays.MaveDB 2024:一个经过整理的社区数据库,包含来自多重功能测定的超过700万个变异效应。
Genome Biol. 2025 Jan 21;26(1):13. doi: 10.1186/s13059-025-03476-y.
3
Creating and leveraging bespoke large-scale knowledge graphs for comparative genomics and multi-omics drug discovery with SocialGene.

本文引用的文献

1
Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings.全基因组测序流程的比较分析,以尽量减少假阴性发现。
Sci Rep. 2019 Mar 1;9(1):3219. doi: 10.1038/s41598-019-39108-2.
2
hgvs: A Python package for manipulating sequence variants using HGVS nomenclature: 2018 Update.HGVS:使用 HGVS 命名法操作序列变异的 Python 包:2018 更新。
Hum Mutat. 2018 Dec;39(12):1803-1813. doi: 10.1002/humu.23615. Epub 2018 Sep 5.
3
Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation.
利用SocialGene创建和利用定制的大规模知识图谱用于比较基因组学和多组学药物发现。
bioRxiv. 2024 Aug 19:2024.08.16.608329. doi: 10.1101/2024.08.16.608329.
4
Mapping MAVE data for use in human genomics applications.映射用于人类基因组学应用的MAVE数据。
bioRxiv. 2024 Jun 30:2023.06.20.545702. doi: 10.1101/2023.06.20.545702.
5
SeqCAT: Sequence Conversion and Analysis Toolbox.SeqCAT:序列转换与分析工具箱。
Nucleic Acids Res. 2024 Jul 5;52(W1):W116-W120. doi: 10.1093/nar/gkae422.
6
Minimum information and guidelines for reporting a multiplexed assay of variant effect.用于报告变异效应多重分析的最低信息和指南。
Genome Biol. 2024 Apr 19;25(1):100. doi: 10.1186/s13059-024-03223-9.
7
The GA4GH Variation Representation Specification: A computational framework for variation representation and federated identification.GA4GH变异表示规范:变异表示与联合识别的计算框架。
Cell Genom. 2021 Nov 10;1(2). doi: 10.1016/j.xgen.2021.100027.
8
Identity and compatibility of reference genome resources.参考基因组资源的一致性和兼容性。
NAR Genom Bioinform. 2021 May 14;3(2):lqab036. doi: 10.1093/nargab/lqab036. eCollection 2021 Jun.
Ensembl核心软件资源:用于DNA序列和基因组注释的存储及编程访问。
Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax020.
4
HGVS Recommendations for the Description of Sequence Variants: 2016 Update.《人类基因组变异协会(HGVS)序列变异描述建议:2016年更新》
Hum Mutat. 2016 Jun;37(6):564-9. doi: 10.1002/humu.22981. Epub 2016 Mar 25.
5
Unified representation of genetic variants.基因变异的统一表示
Bioinformatics. 2015 Jul 1;31(13):2202-4. doi: 10.1093/bioinformatics/btv112. Epub 2015 Feb 19.
6
Tabix: fast retrieval of sequence features from generic TAB-delimited files.Tabix:从通用制表符分隔文件中快速检索序列特征。
Bioinformatics. 2011 Mar 1;27(5):718-9. doi: 10.1093/bioinformatics/btq671. Epub 2011 Jan 5.
7
A database of unique protein sequence identifiers for proteome studies.用于蛋白质组研究的独特蛋白质序列标识符数据库。
Proteomics. 2006 Aug;6(16):4514-22. doi: 10.1002/pmic.200600032.