一种用于从群体规模的DNA序列数据中提取和优化变异体的高效且可扩展的分析框架。

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

作者信息

Jun Goo, Wing Mary Kate, Abecasis Gonçalo R, Kang Hyun Min

机构信息

Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA; Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.

Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.

出版信息

Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.

DOI:10.1101/gr.176552.114

PMID:25883319

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4448687/

Abstract

The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.

摘要

由于数据量巨大且数据质量欠佳，下一代测序数据的分析在计算和统计方面都具有挑战性。我们展示了GotCloud，这是一种用于从大规模测序数据中高效检测高质量变异并进行基因分型的流程。GotCloud能自动执行序列比对、样本级质量控制、变异检测、使用机器学习技术过滤可能的伪影，以及使用单倍型信息进行基因型优化。该流程可以并行处理数千个样本，并且比当前的其他方法需要更少的计算资源。对由千人基因组计划生成的全基因组和外显子靶向序列数据进行的实验表明，该流程能有效过滤假阳性变异，并具有检测真实变异的高能力。我们的流程已经在包括千人基因组计划和美国国立卫生研究院心肺血液研究所外显子测序计划在内的几个大规模测序项目中助力变异检测和基因分型。我们希望它现在能对许多医学测序研究有用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7fa6/4448687/760ce575503b/918f01.jpg

相似文献

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.

Variant Calling From Next Generation Sequence Data.

Methods Mol Biol. 2016;1418:209-24. doi: 10.1007/978-1-4939-3578-9_11.

Challenges in exome analysis by LifeScope and its alternative computational pipelines.

BMC Res Notes. 2015 Sep 7;8:421. doi: 10.1186/s13104-015-1385-4.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping.

FEBS Open Bio. 2021 Sep;11(9):2441-2452. doi: 10.1002/2211-5463.13261. Epub 2021 Aug 11.

Impact of post-alignment processing in variant discovery from whole exome data.

BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.

tarSVM: Improving the accuracy of variant calls derived from microfluidic PCR-based targeted next generation sequencing using a support vector machine.

BMC Bioinformatics. 2016 Jun 10;17(1):233. doi: 10.1186/s12859-016-1108-4.

A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders.

BMC Genomics. 2014;15 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2164-15-S3-S5. Epub 2014 May 6.

WEP: a high-performance analysis pipeline for whole-exome data.

BMC Bioinformatics. 2013;14 Suppl 7(Suppl 7):S11. doi: 10.1186/1471-2105-14-S7-S11. Epub 2013 Apr 22.

From Wet-Lab to Variations: Concordance and Speed of Bioinformatics Pipelines for Whole Genome and Whole Exome Sequencing.

Hum Mutat. 2016 Dec;37(12):1263-1271. doi: 10.1002/humu.23114. Epub 2016 Sep 26.

引用本文的文献

Genetic stability in the lower Yangtze River basin from Song to Qing Dynasty.

BMC Biol. 2025 Aug 29;23(1):270. doi: 10.1186/s12915-025-02343-3.

Genomic analyses support locally derived crown-of-thorns seastar outbreaks in the Pacific.

BMC Biol. 2025 Aug 6;23(1):244. doi: 10.1186/s12915-025-02350-4.

Allele frequency selection and no age-related increase in human oocyte mitochondrial mutations.

Sci Adv. 2025 Aug 8;11(32):eadw4954. doi: 10.1126/sciadv.adw4954. Epub 2025 Aug 6.

The unique morphological basis and repeated evolutionary origins of personate flowers in Penstemon.

Am J Bot. 2025 Aug;112(8):e70078. doi: 10.1002/ajb2.70078. Epub 2025 Jul 31.

Exploring the landscape of autoimmune disorder-associated genes and their impact on immune microenvironment in breast cancer patients.

Am J Transl Res. 2025 Jun 15;17(6):4733-4743. doi: 10.62347/EANH4082. eCollection 2025.

Genetics of growth rate in induced pluripotent stem cells.

bioRxiv. 2025 Jul 3:2025.07.02.662844. doi: 10.1101/2025.07.02.662844.

The demic expansion of Yangshao culture inferred from ancient human genomes.

BMC Biol. 2025 Jul 1;23(1):186. doi: 10.1186/s12915-025-02286-9.

Quantifying DNA point mutations in commercially available AAV reporter vectors and plasmids.

Mol Ther Methods Clin Dev. 2025 Jun 2;33(3):101501. doi: 10.1016/j.omtm.2025.101501. eCollection 2025 Sep 11.

Ancient DNA reveals a two-clanned matrilineal community in Neolithic China.

Nature. 2025 Jun 4. doi: 10.1038/s41586-025-09103-x.

Genetic Formation of Neolithic Hongshan People and Demic Expansion of Hongshan Culture Inferred From Ancient Human Genomes.

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf139.

本文引用的文献

SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications.

PLoS One. 2013 Dec 4;8(12):e82138. doi: 10.1371/journal.pone.0082138. eCollection 2013.

QPLOT: a quality assessment tool for next generation sequencing data.

Biomed Res Int. 2013;2013:865181. doi: 10.1155/2013/865181. Epub 2013 Nov 11.

An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data.

Genome Res. 2013 May;23(5):833-42. doi: 10.1101/gr.146084.112. Epub 2013 Jan 7.

Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.

Nature. 2013 Jan 10;493(7431):216-20. doi: 10.1038/nature11690. Epub 2012 Nov 28.

An integrated map of genetic variation from 1,092 human genomes.

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data.

Am J Hum Genet. 2012 Nov 2;91(5):839-48. doi: 10.1016/j.ajhg.2012.09.004. Epub 2012 Oct 25.

Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Science. 2012 Jul 6;337(6090):64-9. doi: 10.1126/science.1219240. Epub 2012 May 17.

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8.

A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.

Low-coverage sequencing: implications for design of complex trait association studies.

Genome Res. 2011 Jun;21(6):940-51. doi: 10.1101/gr.117259.110. Epub 2011 Apr 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于从群体规模的DNA序列数据中提取和优化变异体的高效且可扩展的分析框架。

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

作者信息

Jun Goo, Wing Mary Kate, Abecasis Gonçalo R, Kang Hyun Min

机构信息

Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.

出版信息

Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.

DOI:10.1101/gr.176552.114

PMID:25883319

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4448687/

Abstract

摘要

一种用于从群体规模的DNA序列数据中提取和优化变异体的高效且可扩展的分析框架。

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种用于从群体规模的DNA序列数据中提取和优化变异体的高效且可扩展的分析框架。

An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献