利用群体数据和深度学习提高变异calling 的准确性。

Improving variant calling using population data and deep learning.

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.

Google Health, Palo Alto, CA, 94304, USA.

出版信息

BMC Bioinformatics. 2023 May 12;24(1):197. doi: 10.1186/s12859-023-05294-0.

DOI:10.1186/s12859-023-05294-0

PMID:37173615

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10182612/

Abstract

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.

摘要

大规模的人群变异数据通常用于筛选和辅助解释单个样本中的变异。这些方法并没有将人群信息直接纳入变异调用过程，并且通常仅限于通过牺牲召回率来提高精确率的筛选。在本研究中，我们开发了基于人群感知的 DeepVariant 模型，该模型使用来自 1000 基因组计划的新通道对等位基因频率进行编码。该模型减少了变异调用错误，提高了单个样本的准确性和召回率，并减少了罕见纯合子和致病性 clinvar 调用全队列。我们评估了使用特定人群或多样化参考面板的情况，发现使用多样化面板的准确性最高，这表明即使在人群与样本祖先匹配的情况下，使用大型多样化面板也优于单个人群。最后，我们表明，即使将祖先也从参考面板中排除，该优势也可以推广到与训练数据具有不同祖先的样本中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5f6c/10182612/ccf4fa7e7d56/12859_2023_5294_Fig1_HTML.jpg

相似文献

Improving variant calling using population data and deep learning.

BMC Bioinformatics. 2023 May 12;24(1):197. doi: 10.1186/s12859-023-05294-0.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs.

Eur J Hum Genet. 2015 Jul;23(7):975-83. doi: 10.1038/ejhg.2014.216. Epub 2014 Oct 8.

Imputation-based assessment of next generation rare exome variant arrays.

Pac Symp Biocomput. 2014:241-52.

Deep whole-genome sequencing of 90 Han Chinese genomes.

Gigascience. 2017 Sep 1;6(9):1-7. doi: 10.1093/gigascience/gix067.

A deep learning approach for filtering structural variants in short read sequencing data.

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa370.

Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals.

Bioinformatics. 2014 Jun 15;30(12):1707-13. doi: 10.1093/bioinformatics/btu067. Epub 2014 Feb 19.

The Use of Non-Variant Sites to Improve the Clinical Assessment of Whole-Genome Sequence Data.

PLoS One. 2015 Jul 6;10(7):e0132180. doi: 10.1371/journal.pone.0132180. eCollection 2015.

tarSVM: Improving the accuracy of variant calls derived from microfluidic PCR-based targeted next generation sequencing using a support vector machine.

BMC Bioinformatics. 2016 Jun 10;17(1):233. doi: 10.1186/s12859-016-1108-4.

Variant calling in low-coverage whole genome sequencing of a Native American population sample.

BMC Genomics. 2014 Jan 30;15:85. doi: 10.1186/1471-2164-15-85.

引用本文的文献

Performance comparison of germline variant calling tools in sporadic disease cohorts.

Mol Genet Genomics. 2025 Sep 6;300(1):90. doi: 10.1007/s00438-025-02292-0.

Beyond the genome: the role of functional markers in contemporary plant breeding.

Front Plant Sci. 2025 Aug 5;16:1637299. doi: 10.3389/fpls.2025.1637299. eCollection 2025.

varCADD: large sets of standing genetic variation enable genome-wide pathogenicity prediction.

Genome Med. 2025 Aug 4;17(1):84. doi: 10.1186/s13073-025-01517-6.

Pangenome-aware DeepVariant.

bioRxiv. 2025 Jun 6:2025.06.05.657102. doi: 10.1101/2025.06.05.657102.

Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain.

Genome Res. 2025 Aug 1;35(8):1859-1874. doi: 10.1101/gr.279542.124.

Toward a Kinh Vietnamese Reference Genome: Constructing a De Novo Genome Assembly Using Long-Read Sequencing and Optical Mapping.

Genes (Basel). 2025 Apr 29;16(5):536. doi: 10.3390/genes16050536.

Artificial intelligence in variant calling: a review.

Front Bioinform. 2025 Apr 23;5:1574359. doi: 10.3389/fbinf.2025.1574359. eCollection 2025.

Translation of genome-wide association study: from genomic signals to biological insights.

Front Genet. 2024 Oct 3;15:1375481. doi: 10.3389/fgene.2024.1375481. eCollection 2024.

DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies.

bioRxiv. 2024 Aug 19:2024.08.16.608331. doi: 10.1101/2024.08.16.608331.

Machine learning: a new era for cardiovascular pregnancy physiology and cardio-obstetrics research.

Am J Physiol Heart Circ Physiol. 2024 Aug 1;327(2):H417-H432. doi: 10.1152/ajpheart.00149.2024. Epub 2024 Jun 7.

本文引用的文献

Benchmarking challenging small variants with linked and long reads.

Cell Genom. 2022 May;2(5). doi: 10.1016/j.xgen.2022.100128.

GENCODE: reference annotation for the human and mouse genomes in 2023.

Nucleic Acids Res. 2023 Jan 6;51(D1):D942-D949. doi: 10.1093/nar/gkac1071.

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.

Cell. 2022 Sep 1;185(18):3426-3440.e19. doi: 10.1016/j.cell.2022.08.004.

PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions.

Cell Genom. 2022 May 11;2(5). doi: 10.1016/j.xgen.2022.100129. Epub 2022 Apr 27.

Pan-human consensus genome significantly improves the accuracy of RNA-seq analyses.

Genome Res. 2022 Apr;32(4):738-749. doi: 10.1101/gr.275613.121. Epub 2022 Mar 7.

Comparison of GATK and DeepVariant by trio sequencing.

Sci Rep. 2022 Feb 2;12(1):1809. doi: 10.1038/s41598-022-05833-4.

Exome sequencing and analysis of 454,787 UK Biobank participants.

Nature. 2021 Nov;599(7886):628-634. doi: 10.1038/s41586-021-04103-z. Epub 2021 Oct 18.

indelPost: harmonizing ambiguities in simple and complex indel alignments.

Bioinformatics. 2022 Jan 3;38(2):549-551. doi: 10.1093/bioinformatics/btab601.

Rare variant contribution to human disease in 281,104 UK Biobank exomes.

Nature. 2021 Sep;597(7877):527-532. doi: 10.1038/s41586-021-03855-y. Epub 2021 Aug 10.

Effective variant filtering and expected candidate variant yield in studies of rare human disease.

NPJ Genom Med. 2021 Jul 15;6(1):60. doi: 10.1038/s41525-021-00227-3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用群体数据和深度学习提高变异calling 的准确性。

Improving variant calling using population data and deep learning.

机构信息

Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.

Google Health, Palo Alto, CA, 94304, USA.

出版信息

BMC Bioinformatics. 2023 May 12;24(1):197. doi: 10.1186/s12859-023-05294-0.

DOI:10.1186/s12859-023-05294-0

PMID:37173615

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10182612/

Abstract

摘要

利用群体数据和深度学习提高变异calling 的准确性。

Improving variant calling using population data and deep learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用群体数据和深度学习提高变异calling 的准确性。

Improving variant calling using population data and deep learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献