Suppr超能文献

基因组分析工具包:一种用于分析下一代 DNA 测序数据的 MapReduce 框架。

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

机构信息

Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA.

出版信息

Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.

Abstract

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

摘要

下一代 DNA 测序(NGS)项目,如 1000 基因组计划,已经正在彻底改变我们对个体间遗传变异的理解。然而,NGS 产生的海量数据集——仅 1000 基因组计划的先导项目就包含近 5 太字节的数据——使得即使是计算能力高超的人也很难编写功能丰富、高效和稳健的分析工具。事实上,许多专业人员受到这些机器生成的数据的复杂性的限制,他们在回答科学问题的范围和方便性上都受到了限制。在这里,我们讨论了我们的基因组分析工具包(GATK),这是一个结构化编程框架,旨在使用 MapReduce 的函数式编程理念,简化下一代 DNA 测序仪高效和稳健分析工具的开发。GATK 提供了一小部分但丰富的数据访问模式,涵盖了大多数分析工具的需求。将特定的分析计算与常见的数据管理基础设施分离,使我们能够针对正确性、稳定性以及 CPU 和内存效率优化 GATK 框架,并启用分布式和共享内存并行化。我们通过描述覆盖计算器和单核苷酸多态性(SNP)调用等稳健、可扩展工具的实现和应用,突出了 GATK 的功能。我们得出结论,GATK 编程框架使开发人员和分析师能够快速、轻松地编写高效和稳健的 NGS 工具,其中许多工具已经被纳入大规模测序项目,如 1000 基因组计划和癌症基因组图谱。

相似文献

5
SeqWare Query Engine: storing and searching sequence data in the cloud.SeqWare 查询引擎:在云端存储和搜索序列数据。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.
6
Halvade: scalable sequence analysis with MapReduce.Halvade:使用MapReduce进行可扩展序列分析。
Bioinformatics. 2015 Aug 1;31(15):2482-8. doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.

引用本文的文献

本文引用的文献

5
PIQA: pipeline for Illumina G1 genome analyzer data quality assessment.PIQA:Illumina G1 基因组分析仪数据质量评估流水线。
Bioinformatics. 2009 Sep 15;25(18):2438-9. doi: 10.1093/bioinformatics/btp429. Epub 2009 Jul 14.
8
The Sequence Alignment/Map format and SAMtools.序列比对/映射格式和 SAMtools。
Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.
9
Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验