Institute for Research and Innovation, ARUP Labs, Salt Lake City, UT 84108, United States.
Institute for Clinical and Experimental Pathology, ARUP Labs, Salt Lake City, UT 84108, United States.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae565.
Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.
We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.
Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.
在下一代测序数据中检测种系变体是现代基因组学分析的一个重要组成部分。变体检测工具通常依赖于统计算法,如 de Bruijn 图或隐马尔可夫模型,并且通常与启发式技术和阈值相结合,以最大限度地提高准确性。尽管近年来取得了重大进展,但目前的方法在典型的人类全基因组中仍会产生数千个假阳性检测,这给手动审查带来了巨大的负担。
我们引入了一种新方法,用单个深度生成模型取代了以前方法的手工制作的统计技术。我们的模型使用基于标准转换器的编码器和双解码器架构,以与现代大型语言模型相同的生成方式学习构建二倍体种系单倍型。我们在 37 个来自基因组瓶样的全基因组序列上训练我们的模型,并证明我们的方法能够学习生成准确的单倍型,具有正确的相位和基因型,适用于所有小变体类别。我们将我们的方法称为 Jenever,与 FreeBayes、GATK HaplotypeCaller、Clair3 和 DeepVariant 进行比较,并证明我们的方法与其他方法相比具有更高的整体准确性。在最大化 F1 值的质量阈值下,我们的模型在插入和缺失变体方面提供了最高的灵敏度、精度和最少的基因分型错误。对于单核苷酸变体,我们的模型表现出最高的灵敏度,但精度略低,并且在我们测试的所有调用者中实现了最高的整体 F1 得分。
Jenever 作为一个基于 python 的命令行工具实现。源代码可在 https://github.com/ARUP-NGS/jenever/ 获得。