Suppr超能文献

生成单倍型预测在下一代测序数据中小变异检测方面优于统计方法。

Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.

机构信息

Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States.

Institute for Clinical and Experimental Pathology, ARUP Labs, Salt Lake City, UT 84108, United States.

出版信息

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae565.

Abstract

MOTIVATION

Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.

RESULTS

We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.

AVAILABILITY AND IMPLEMENTATION

Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.

摘要

动机

在下一代测序数据中检测种系变体是现代基因组学分析的一个重要组成部分。变体检测工具通常依赖于统计算法,如 de Bruijn 图或隐马尔可夫模型,并且通常与启发式技术和阈值相结合,以最大限度地提高准确性。尽管近年来取得了重大进展,但目前的方法在典型的人类全基因组中仍会产生数千个假阳性检测,这给手动审查带来了巨大的负担。

结果

我们引入了一种新方法,用单个深度生成模型取代了以前方法的手工制作的统计技术。我们的模型使用基于标准转换器的编码器和双解码器架构,以与现代大型语言模型相同的生成方式学习构建二倍体种系单倍型。我们在 37 个来自基因组瓶样的全基因组序列上训练我们的模型,并证明我们的方法能够学习生成准确的单倍型,具有正确的相位和基因型,适用于所有小变体类别。我们将我们的方法称为 Jenever,与 FreeBayes、GATK HaplotypeCaller、Clair3 和 DeepVariant 进行比较,并证明我们的方法与其他方法相比具有更高的整体准确性。在最大化 F1 值的质量阈值下,我们的模型在插入和缺失变体方面提供了最高的灵敏度、精度和最少的基因分型错误。对于单核苷酸变体,我们的模型表现出最高的灵敏度,但精度略低,并且在我们测试的所有调用者中实现了最高的整体 F1 得分。

可用性和实现

Jenever 作为一个基于 python 的命令行工具实现。源代码可在 https://github.com/ARUP-NGS/jenever/ 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d8f8/11549014/4042a267276f/btae565f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验