Lawrence Michael, Morgan Martin
Genentech, 1 DNA Way, South San Francisco, California 94080, USA
Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., P.O. Box 19024, Seattle, Washington 98109, USA
Stat Sci. 2014 May;29(2):214-226. doi: 10.1214/14-STS476. Epub 2014 Aug 18.
This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.
本文回顾了在分析大型基因组数据集时遇到的问题的解决策略,并描述了Bioconductor项目中的软件包在R语言中对这些策略的实现。我们探讨了大型基因组数据的可扩展处理、汇总和可视化。其总体思路已很成熟,包括限制性查询、压缩、迭代和并行计算。我们通过应用Bioconductor软件包对全基因组测序实验中的遗传变异进行检测和分析来展示这些策略。