StatsDB：与平台无关的下一代测序运行指标存储与解读

StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics.

作者信息

Ramirez-Gonzalez Ricardo H, Leggett Richard M, Waite Darren, Thanki Anil, Drou Nizar, Caccamo Mario, Davey Robert

机构信息

The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH, UK.

出版信息

F1000Res. 2013 Nov 15;2:248. doi: 10.12688/f1000research.2-248.v2. eCollection 2013.

DOI:10.12688/f1000research.2-248.v2

PMID:24627795

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3938176/

Abstract

Modern sequencing platforms generate enormous quantities of data in ever-decreasing amounts of time. Additionally, techniques such as multiplex sequencing allow one run to contain hundreds of different samples. With such data comes a significant challenge to understand its quality and to understand how the quality and yield are changing across instruments and over time. As well as the desire to understand historical data, sequencing centres often have a duty to provide clear summaries of individual run performance to collaborators or customers. We present StatsDB, an open-source software package for storage and analysis of next generation sequencing run metrics. The system has been designed for incorporation into a primary analysis pipeline, either at the programmatic level or via integration into existing user interfaces. Statistics are stored in an SQL database and APIs provide the ability to store and access the data while abstracting the underlying database design. This abstraction allows simpler, wider querying across multiple fields than is possible by the manual steps and calculation required to dissect individual reports, e.g. "provide metrics about nucleotide bias in libraries using adaptor barcode X, across all runs on sequencer A, within the last month". The software is supplied with modules for storage of statistics from FastQC, a commonly used tool for analysis of sequence reads, but the open nature of the database schema means it can be easily adapted to other tools. Currently at The Genome Analysis Centre (TGAC), reports are accessed through our LIMS system or through a standalone GUI tool, but the API and supplied examples make it easy to develop custom reports and to interface with other packages.

摘要

现代测序平台能在越来越短的时间内生成海量数据。此外，多重测序等技术使得一次运行能够包含数百个不同样本。面对如此庞大的数据，要了解其质量以及质量和产量如何随仪器及时间变化，是一项重大挑战。除了想要了解历史数据，测序中心通常还有责任向合作者或客户提供单个运行性能的清晰总结。我们展示了StatsDB，这是一个用于存储和分析下一代测序运行指标的开源软件包。该系统设计用于集成到主要分析流程中，既可以在编程层面进行，也可以通过集成到现有的用户界面中实现。统计数据存储在SQL数据库中，API提供了存储和访问数据的能力，同时抽象了底层数据库设计。这种抽象使得跨多个字段进行更简单、更广泛的查询成为可能，而这是剖析单个报告所需的手动步骤和计算无法做到的，例如“提供关于使用接头条形码X的文库中核苷酸偏差的指标，涉及测序仪A上过去一个月内的所有运行”。该软件随附了用于存储来自FastQC（一种常用的序列读取分析工具）的统计数据的模块，但数据库模式的开放性意味着它可以轻松适配其他工具。目前在基因组分析中心（TGAC），报告可通过我们的实验室信息管理系统（LIMS）或独立的图形用户界面（GUI）工具进行访问，但API和提供的示例使得开发自定义报告以及与其他软件包进行交互变得很容易。

相似文献

StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics.StatsDB：与平台无关的下一代测序运行指标存储与解读

F1000Res. 2013 Nov 15;2:248. doi: 10.12688/f1000research.2-248.v2. eCollection 2013.

Analysing 454 amplicon resequencing experiments using the modular and database oriented Variant Identification Pipeline.使用模块化和面向数据库的变异识别管道分析 454 扩增子重测序实验。

BMC Bioinformatics. 2010 May 20;11:269. doi: 10.1186/1471-2105-11-269.

NG6: Integrated next generation sequencing storage and processing environment.NG6：集成下一代测序存储和处理环境。

BMC Genomics. 2012 Sep 9;13:462. doi: 10.1186/1471-2164-13-462.

Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics.用于实现高通量基因组学数据驱动信息学的测序质量评估工具。

Front Genet. 2013 Dec 17;4:288. doi: 10.3389/fgene.2013.00288.

SLIM: a flexible web application for the reproducible processing of environmental DNA metabarcoding data.SLIM：一个灵活的网络应用程序，用于可重复处理环境 DNA metabarcoding 数据。

BMC Bioinformatics. 2019 Feb 19;20(1):88. doi: 10.1186/s12859-019-2663-2.

Galaxy LIMS for next-generation sequencing.星系二代测序实验室信息管理系统。

Bioinformatics. 2013 May 1;29(9):1233-4. doi: 10.1093/bioinformatics/btt115. Epub 2013 Mar 11.

One tool to find them all: a case of data integration and querying in a distributed LIMS platform.一种工具，全部找到：在分布式 LIMS 平台中进行数据集成和查询的案例。

Database (Oxford). 2019 Jan 1;2019:baz004. doi: 10.1093/database/baz004.

ngsReports: a Bioconductor package for managing FastQC reports and other NGS related log files.ngsReports：一个用于管理 FastQC 报告和其他与 NGS 相关的日志文件的 Bioconductor 包。

Bioinformatics. 2020 Apr 15;36(8):2587-2588. doi: 10.1093/bioinformatics/btz937.

Assembling proteomics data as a prerequisite for the analysis of large scale experiments.组装蛋白质组学数据作为大规模实验分析的前提条件。

Chem Cent J. 2009 Jan 23;3:2. doi: 10.1186/1752-153X-3-2.

Human variation database: an open-source database template for genomic discovery.人类变异数据库：一个用于基因组发现的开源数据库模板。

Bioinformatics. 2011 Apr 15;27(8):1155-6. doi: 10.1093/bioinformatics/btr100. Epub 2011 Mar 2.

引用本文的文献

LINC01235 Promotes Clonal Evolution through DNA Replication Licensing-Induced Chromosomal Instability in Breast Cancer.LINC01235 通过 DNA 复制许可诱导的染色体不稳定性促进乳腺癌的克隆进化。

Adv Sci (Weinh). 2025 Apr;12(14):e2413527. doi: 10.1002/advs.202413527. Epub 2025 Feb 14.

A20 Restricts NOS2 Expression and Intestinal Tumorigenesis in a Mouse Model of Colitis-Associated Cancer.A20在结肠炎相关癌症小鼠模型中限制一氧化氮合酶2表达及肠道肿瘤发生。

Gastro Hep Adv. 2023;2(1):96-107. doi: 10.1016/j.gastha.2022.09.004. Epub 2022 Sep 19.

Employing toxin-antitoxin genome markers for identification of and strains in human metagenomes.利用毒素-抗毒素基因组标记物鉴定人类宏基因组中的[具体菌株1]和[具体菌株2]菌株。（注：原文中“ and ”部分信息缺失，这里用[具体菌株1]和[具体菌株2]表示，实际翻译时应补充完整准确信息）

PeerJ. 2019 Mar 4;7:e6554. doi: 10.7717/peerj.6554. eCollection 2019.

Identification of potential genes for human ischemic cardiomyopathy based on RNA-Seq data.基于RNA测序数据鉴定人类缺血性心肌病的潜在基因

Oncotarget. 2016 Dec 13;7(50):82063-82073. doi: 10.18632/oncotarget.13331.

A Multistate Toggle Switch Defines Fungal Cell Fates and Is Regulated by Synergistic Genetic Cues.一种多状态切换开关定义真菌细胞命运并受协同遗传线索调控。

PLoS Genet. 2016 Oct 6;12(10):e1006353. doi: 10.1371/journal.pgen.1006353. eCollection 2016 Oct.

AlmostSignificant: simplifying quality control of high-throughput sequencing data.近乎显著：简化高通量测序数据的质量控制

Bioinformatics. 2016 Dec 15;32(24):3850-3851. doi: 10.1093/bioinformatics/btw559. Epub 2016 Aug 24.

Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.Qualimap 2：用于高通量测序数据的高级多样本质量控制

Bioinformatics. 2016 Jan 15;32(2):292-4. doi: 10.1093/bioinformatics/btv566. Epub 2015 Oct 1.

GacA is essential for Group A Streptococcus and defines a new class of monomeric dTDP-4-dehydrorhamnose reductases (RmlD).GacA对A群链球菌至关重要，并定义了一类新的单体二磷酸胸苷-4-脱水鼠李糖还原酶（RmlD）。

Mol Microbiol. 2015 Dec;98(5):946-62. doi: 10.1111/mmi.13169. Epub 2015 Oct 1.

Next-Generation Sequencing Techniques Reveal that Genomic Imprinting Is Absent in Day-Old Gallus gallus domesticus Brains.新一代测序技术揭示，一日龄家鸡大脑中不存在基因组印记现象。

PLoS One. 2015 Jul 10;10(7):e0132345. doi: 10.1371/journal.pone.0132345. eCollection 2015.

Draft Genome Sequences of Devosia sp. Strain 17-2-E-8 and Devosia riboflavina Strain IFO13584.德沃斯氏菌属菌株17 - 2 - E - 8和核黄素德沃斯氏菌菌株IFO13584的基因组序列草图

Genome Announc. 2014 Oct 2;2(5):e00994-14. doi: 10.1128/genomeA.00994-14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

StatsDB：与平台无关的下一代测序运行指标存储与解读

StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献