• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生成单倍型预测在下一代测序数据中小变异检测方面优于统计方法。

Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.

机构信息

Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States.

Institute for Clinical and Experimental Pathology, ARUP Labs, Salt Lake City, UT 84108, United States.

出版信息

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae565.

DOI:10.1093/bioinformatics/btae565
PMID:39298478
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11549014/
Abstract

MOTIVATION

Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.

RESULTS

We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.

AVAILABILITY AND IMPLEMENTATION

Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.

摘要

动机

在下一代测序数据中检测种系变体是现代基因组学分析的一个重要组成部分。变体检测工具通常依赖于统计算法,如 de Bruijn 图或隐马尔可夫模型,并且通常与启发式技术和阈值相结合,以最大限度地提高准确性。尽管近年来取得了重大进展,但目前的方法在典型的人类全基因组中仍会产生数千个假阳性检测,这给手动审查带来了巨大的负担。

结果

我们引入了一种新方法,用单个深度生成模型取代了以前方法的手工制作的统计技术。我们的模型使用基于标准转换器的编码器和双解码器架构,以与现代大型语言模型相同的生成方式学习构建二倍体种系单倍型。我们在 37 个来自基因组瓶样的全基因组序列上训练我们的模型,并证明我们的方法能够学习生成准确的单倍型,具有正确的相位和基因型,适用于所有小变体类别。我们将我们的方法称为 Jenever,与 FreeBayes、GATK HaplotypeCaller、Clair3 和 DeepVariant 进行比较,并证明我们的方法与其他方法相比具有更高的整体准确性。在最大化 F1 值的质量阈值下,我们的模型在插入和缺失变体方面提供了最高的灵敏度、精度和最少的基因分型错误。对于单核苷酸变体,我们的模型表现出最高的灵敏度,但精度略低,并且在我们测试的所有调用者中实现了最高的整体 F1 得分。

可用性和实现

Jenever 作为一个基于 python 的命令行工具实现。源代码可在 https://github.com/ARUP-NGS/jenever/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/dfaecfdab105/btae565f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/4042a267276f/btae565f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/5547f523cfc7/btae565f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/8de29a6afb35/btae565f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/88cfeea1148c/btae565f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/c7110e12e581/btae565f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/254e618b7609/btae565f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/3b80516bbc86/btae565f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/dfaecfdab105/btae565f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/4042a267276f/btae565f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/5547f523cfc7/btae565f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/8de29a6afb35/btae565f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/88cfeea1148c/btae565f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/c7110e12e581/btae565f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/254e618b7609/btae565f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/3b80516bbc86/btae565f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/dfaecfdab105/btae565f8.jpg

相似文献

1
Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.生成单倍型预测在下一代测序数据中小变异检测方面优于统计方法。
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae565.
2
Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果,并改进来自深度覆盖全基因组测序数据的变异检测集。
Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.
3
Comparison of GATK and DeepVariant by trio sequencing.基于 trio 测序的 GATK 和 DeepVariant 比较。
Sci Rep. 2022 Feb 2;12(1):1809. doi: 10.1038/s41598-022-05833-4.
4
Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads.使用 PEPPER-Margin-DeepVariant 进行单体型感知变异调用可实现纳米孔长读段的高精度。
Nat Methods. 2021 Nov;18(11):1322-1332. doi: 10.1038/s41592-021-01299-w. Epub 2021 Nov 1.
5
Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data.利用跨越多个单核苷酸多态性的读取信息,从测序数据中推断单倍型。
Bioinformatics. 2013 Sep 15;29(18):2245-52. doi: 10.1093/bioinformatics/btt386. Epub 2013 Jul 3.
6
Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing.Longshot 可通过单分子长读测序对二倍体基因组进行准确的变异调用。
Nat Commun. 2019 Oct 11;10(1):4660. doi: 10.1038/s41467-019-12493-y.
7
NanoSNP: a progressive and haplotype-aware SNP caller on low-coverage nanopore sequencing data.NanoSNP:一种针对低覆盖度纳米孔测序数据的渐进式、单体型感知 SNP 调用程序。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac824.
8
Comparing the performance of selected variant callers using synthetic data and genome segmentation.使用合成数据和基因组分割比较选定变异调用程序的性能。
BMC Bioinformatics. 2018 Nov 19;19(1):429. doi: 10.1186/s12859-018-2440-7.
9
HapCUT2: A Method for Phasing Genomes Using Experimental Sequence Data.HapCUT2:一种使用实验序列数据进行基因组相位分析的方法。
Methods Mol Biol. 2023;2590:139-147. doi: 10.1007/978-1-0716-2819-5_9.
10
Haplotype phasing in single-cell DNA-sequencing data.单细胞 DNA 测序数据中的单倍型相位。
Bioinformatics. 2018 Jul 1;34(13):i211-i217. doi: 10.1093/bioinformatics/bty286.

引用本文的文献

1
Learning-based parallel acceleration for HaplotypeCaller.基于学习的单倍型分型器并行加速技术
BMC Bioinformatics. 2025 Aug 20;26(1):217. doi: 10.1186/s12859-025-06242-w.

本文引用的文献

1
Benchmarking challenging small variants with linked and long reads.使用连锁读段和长读段对具有挑战性的小变异进行基准测试。
Cell Genom. 2022 May;2(5). doi: 10.1016/j.xgen.2022.100128.
2
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer.DeepConsensus 通过具有间隙感知序列转换器提高序列的准确性。
Nat Biotechnol. 2023 Feb;41(2):232-238. doi: 10.1038/s41587-022-01435-7. Epub 2022 Sep 1.
3
A unified haplotype-based method for accurate and comprehensive variant calling.基于统一单倍型的精确和全面变异calling 方法。
Nat Biotechnol. 2021 Jul;39(7):885-892. doi: 10.1038/s41587-021-00861-3. Epub 2021 Mar 29.
4
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery.迅猛龙:提高体细胞插入缺失发现的灵敏度并控制错误发现率。
Genome Biol. 2020 Apr 28;21(1):98. doi: 10.1186/s13059-020-01993-6.
5
ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest.ForestQC:使用随机森林对下一代测序数据中的遗传变异进行质量控制。
PLoS Comput Biol. 2019 Dec 18;15(12):e1007556. doi: 10.1371/journal.pcbi.1007556. eCollection 2019 Dec.
6
Best practices for benchmarking germline small-variant calls in human genomes.人类基因组中小变异calls 的基准测试最佳实践。
Nat Biotechnol. 2019 May;37(5):555-560. doi: 10.1038/s41587-019-0054-x. Epub 2019 Mar 11.
7
A multi-task convolutional deep neural network for variant calling in single molecule sequencing.一种用于单分子测序中变异调用的多任务卷积深度神经网络。
Nat Commun. 2019 Mar 1;10(1):998. doi: 10.1038/s41467-019-09025-z.
8
A universal SNP and small-indel variant caller using deep neural networks.使用深度神经网络的通用 SNP 和小插入缺失变体调用器。
Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24.
9
Strelka2: fast and accurate calling of germline and somatic variants.Strelka2:快速准确地调用种系和体细胞变异。
Nat Methods. 2018 Aug;15(8):591-594. doi: 10.1038/s41592-018-0051-x. Epub 2018 Jul 16.
10
Medical implications of technical accuracy in genome sequencing.基因组测序技术准确性的医学意义。
Genome Med. 2016 Mar 2;8(1):24. doi: 10.1186/s13073-016-0269-0.