Ramirez-Gonzalez Ricardo H, Leggett Richard M, Waite Darren, Thanki Anil, Drou Nizar, Caccamo Mario, Davey Robert
The Genome Analysis Centre, Norwich Research Park, Norwich, NR4 7UH, UK.
F1000Res. 2013 Nov 15;2:248. doi: 10.12688/f1000research.2-248.v2. eCollection 2013.
Modern sequencing platforms generate enormous quantities of data in ever-decreasing amounts of time. Additionally, techniques such as multiplex sequencing allow one run to contain hundreds of different samples. With such data comes a significant challenge to understand its quality and to understand how the quality and yield are changing across instruments and over time. As well as the desire to understand historical data, sequencing centres often have a duty to provide clear summaries of individual run performance to collaborators or customers. We present StatsDB, an open-source software package for storage and analysis of next generation sequencing run metrics. The system has been designed for incorporation into a primary analysis pipeline, either at the programmatic level or via integration into existing user interfaces. Statistics are stored in an SQL database and APIs provide the ability to store and access the data while abstracting the underlying database design. This abstraction allows simpler, wider querying across multiple fields than is possible by the manual steps and calculation required to dissect individual reports, e.g. "provide metrics about nucleotide bias in libraries using adaptor barcode X, across all runs on sequencer A, within the last month". The software is supplied with modules for storage of statistics from FastQC, a commonly used tool for analysis of sequence reads, but the open nature of the database schema means it can be easily adapted to other tools. Currently at The Genome Analysis Centre (TGAC), reports are accessed through our LIMS system or through a standalone GUI tool, but the API and supplied examples make it easy to develop custom reports and to interface with other packages.
现代测序平台能在越来越短的时间内生成海量数据。此外,多重测序等技术使得一次运行能够包含数百个不同样本。面对如此庞大的数据,要了解其质量以及质量和产量如何随仪器及时间变化,是一项重大挑战。除了想要了解历史数据,测序中心通常还有责任向合作者或客户提供单个运行性能的清晰总结。我们展示了StatsDB,这是一个用于存储和分析下一代测序运行指标的开源软件包。该系统设计用于集成到主要分析流程中,既可以在编程层面进行,也可以通过集成到现有的用户界面中实现。统计数据存储在SQL数据库中,API提供了存储和访问数据的能力,同时抽象了底层数据库设计。这种抽象使得跨多个字段进行更简单、更广泛的查询成为可能,而这是剖析单个报告所需的手动步骤和计算无法做到的,例如“提供关于使用接头条形码X的文库中核苷酸偏差的指标,涉及测序仪A上过去一个月内的所有运行”。该软件随附了用于存储来自FastQC(一种常用的序列读取分析工具)的统计数据的模块,但数据库模式的开放性意味着它可以轻松适配其他工具。目前在基因组分析中心(TGAC),报告可通过我们的实验室信息管理系统(LIMS)或独立的图形用户界面(GUI)工具进行访问,但API和提供的示例使得开发自定义报告以及与其他软件包进行交互变得很容易。