VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA 94304, USA.
Department of Genetics.
Bioinformatics. 2017 Dec 1;33(23):3709-3715. doi: 10.1093/bioinformatics/btx468.
Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.
We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.
Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.
cuiping@stanford.edu or ptsao@stanford.edu.
Supplementary data are available at Bioinformatics online.
大规模基因组测序现在被广泛用于解决生物学功能、人类疾病、进化、生态系统和农业等各个领域的问题。由于这些数据的数量和多样性,需要一个强大的、可扩展的数据处理和分析解决方案。
我们使用基于 Dremel 的云基础列式数据库提供交互式分析,以在大量基因组数据中执行信息压缩、全面的质量控制和生物信息检索。我们证明,这种大数据计算范例可以为常见的基因组分析提供数量级更快的周转时间,将通过 Linux 外壳提交的长时间运行的批处理作业转换为可以在几秒钟内从网络浏览器提出的问题。使用这种方法,我们评估了 475 个人类基因组的深度测序研究人群的基因组呼叫率、基因型和等位基因频率分布、基因组范围内的变异密度以及药物基因组学信息。
我们的分析框架在 Google Cloud Platform 和 BigQuery 中实现。代码可在 https://github.com/StanfordBioinformatics/mvp_aaa_codelabs 获得。
cuiping@stanford.edu 或 ptsao@stanford.edu。
补充数据可在 Bioinformatics 在线获得。