基于云的交互式分析，用于处理 TB 级别的基因组变体数据。

Cloud-based interactive analytics for terabytes of genomic variants data.

机构信息

VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA 94304, USA.

Department of Genetics.

出版信息

Bioinformatics. 2017 Dec 1;33(23):3709-3715. doi: 10.1093/bioinformatics/btx468.

DOI:10.1093/bioinformatics/btx468

PMID:28961771

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5860318/

Abstract

MOTIVATION

Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.

RESULTS

We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.

AVAILABILITY AND IMPLEMENTATION

Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.

CONTACT

cuiping@stanford.edu or ptsao@stanford.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

大规模基因组测序现在被广泛用于解决生物学功能、人类疾病、进化、生态系统和农业等各个领域的问题。由于这些数据的数量和多样性，需要一个强大的、可扩展的数据处理和分析解决方案。

结果

我们使用基于 Dremel 的云基础列式数据库提供交互式分析，以在大量基因组数据中执行信息压缩、全面的质量控制和生物信息检索。我们证明，这种大数据计算范例可以为常见的基因组分析提供数量级更快的周转时间，将通过 Linux 外壳提交的长时间运行的批处理作业转换为可以在几秒钟内从网络浏览器提出的问题。使用这种方法，我们评估了 475 个人类基因组的深度测序研究人群的基因组呼叫率、基因型和等位基因频率分布、基因组范围内的变异密度以及药物基因组学信息。

可用性和实现

我们的分析框架在 Google Cloud Platform 和 BigQuery 中实现。代码可在 https://github.com/StanfordBioinformatics/mvp_aaa_codelabs 获得。

联系人

cuiping@stanford.edu 或 ptsao@stanford.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Cloud-based interactive analytics for terabytes of genomic variants data.基于云的交互式分析，用于处理 TB 级别的基因组变体数据。

Bioinformatics. 2017 Dec 1;33(23):3709-3715. doi: 10.1093/bioinformatics/btx468.

The Clinical Genome and Ancestry Report: An interactive web application for prioritizing clinically implicated variants from genome sequencing data with ancestry composition.临床基因组和祖源报告：一个用于从基因组测序数据中根据祖源成分优先考虑具有临床意义的变异的交互式网络应用程序。

Hum Mutat. 2020 Feb;41(2):387-396. doi: 10.1002/humu.23942. Epub 2019 Nov 15.

Visualizing the geography of genetic variants.可视化基因变异的分布情况。

Bioinformatics. 2017 Feb 15;33(4):594-595. doi: 10.1093/bioinformatics/btw643.

Assemblytics: a web analytics tool for the detection of variants from an assembly.Assemblytics：一种用于从组装中检测变异的网络分析工具。

Bioinformatics. 2016 Oct 1;32(19):3021-3. doi: 10.1093/bioinformatics/btw369. Epub 2016 Jun 17.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果，并改进来自深度覆盖全基因组测序数据的变异检测集。

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

SnpHub: an easy-to-set-up web server framework for exploring large-scale genomic variation data in the post-genomic era with applications in wheat.SnpHub：一个易于搭建的 Web 服务器框架，用于在后基因组时代探索大规模基因组变异数据，在小麦中有应用。

Gigascience. 2020 Jun 1;9(6). doi: 10.1093/gigascience/giaa060.

GTRAC: fast retrieval from compressed collections of genomic variants.GTRAC：从基因组变异的压缩集合中快速检索

Bioinformatics. 2016 Sep 1;32(17):i479-i486. doi: 10.1093/bioinformatics/btw437.

CRAMER: a lightweight, highly customizable web-based genome browser supporting multiple visualization instances.CRAMER：一个轻量级、高度可定制的基于网络的基因组浏览器，支持多个可视化实例。

Bioinformatics. 2020 Jun 1;36(11):3556-3557. doi: 10.1093/bioinformatics/btaa146.

SNiPA: an interactive, genetic variant-centered annotation browser.SNiPA：一个交互式的、以基因变异为中心的注释浏览器。

Bioinformatics. 2015 Apr 15;31(8):1334-6. doi: 10.1093/bioinformatics/btu779. Epub 2014 Nov 26.

Aether: leveraging linear programming for optimal cloud computing in genomics.以太：利用线性规划实现基因组学中云计算的最佳化。

Bioinformatics. 2018 May 1;34(9):1565-1567. doi: 10.1093/bioinformatics/btx787.

引用本文的文献

The Stanford Medicine data science ecosystem for clinical and translational research.用于临床和转化研究的斯坦福医学数据科学生态系统。

JAMIA Open. 2023 Aug 2;6(3):ooad054. doi: 10.1093/jamiaopen/ooad054. eCollection 2023 Oct.

KAUST Metagenomic Analysis Platform (KMAP), enabling access to massive analytics of re-annotated metagenomic data.KAUST 宏基因组分析平台 (KMAP)，可实现对重新注释的宏基因组数据的大规模分析。

Sci Rep. 2021 Jun 1;11(1):11511. doi: 10.1038/s41598-021-90799-y.

Swarm: A federated cloud framework for large-scale variant analysis.蜂群：用于大规模变体分析的联邦云框架。

PLoS Comput Biol. 2021 May 12;17(5):e1008977. doi: 10.1371/journal.pcbi.1008977. eCollection 2021 May.

Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services.全基因组关联研究在谷歌云平台和亚马逊网络服务上的可扩展性和成本效益分析。

J Am Med Inform Assoc. 2020 Sep 1;27(9):1425-1430. doi: 10.1093/jamia/ocaa068.

本文引用的文献

Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder.全基因组测序资源鉴定出18个自闭症谱系障碍的新候选基因。

Nat Neurosci. 2017 Apr;20(4):602-611. doi: 10.1038/nn.4524. Epub 2017 Mar 6.

Comprehensive population-based genome sequencing provides insight into hematopoietic regulatory mechanisms.基于人群的全面基因组测序为造血调控机制提供了深入见解。

Proc Natl Acad Sci U S A. 2017 Jan 17;114(3):E327-E336. doi: 10.1073/pnas.1619052114. Epub 2016 Dec 28.

Genetic identification of familial hypercholesterolemia within a single U.S. health care system.在美国单一医疗体系中对家族性高胆固醇血症进行基因鉴定。

Science. 2016 Dec 23;354(6319). doi: 10.1126/science.aaf7000.

A genetic association study of CSMD1 and CSMD2 with cognitive function.CSMD1 和 CSMD2 基因与认知功能的关联研究。

Brain Behav Immun. 2017 Mar;61:209-216. doi: 10.1016/j.bbi.2016.11.026. Epub 2016 Nov 25.

Evidence and resources to implement pharmacogenetic knowledge for precision medicine.实施精准医学药物遗传学知识的证据和资源。

Am J Health Syst Pharm. 2016 Dec 1;73(23):1977-1985. doi: 10.2146/ajhp150977.

Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics.临床外显子组和基因组测序中次要发现报告的建议，2016年更新版（美国医学遗传学与基因组学学会次要发现v2.0）：美国医学遗传学与基因组学学会政策声明

Genet Med. 2017 Feb;19(2):249-255. doi: 10.1038/gim.2016.190. Epub 2016 Nov 17.

Complement inhibitor CSMD1 acts as tumor suppressor in human breast cancer.补体抑制剂CSMD1在人类乳腺癌中发挥肿瘤抑制作用。

Oncotarget. 2016 Nov 22;7(47):76920-76933. doi: 10.18632/oncotarget.12729.

Deep sequencing of 10,000 human genomes.一万个人类基因组的深度测序。

Proc Natl Acad Sci U S A. 2016 Oct 18;113(42):11901-11906. doi: 10.1073/pnas.1613365113. Epub 2016 Oct 4.

Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis.全基因组关联分析确定了肌萎缩侧索硬化症的新风险变异和遗传结构。

Nat Genet. 2016 Sep;48(9):1043-8. doi: 10.1038/ng.3622. Epub 2016 Jul 25.

NEK1 variants confer susceptibility to amyotrophic lateral sclerosis.NEK1基因变异会增加患肌萎缩侧索硬化症的易感性。

Nat Genet. 2016 Sep;48(9):1037-42. doi: 10.1038/ng.3626. Epub 2016 Jul 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验