Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
Google Health, Palo Alto, CA, 94304, USA.
BMC Bioinformatics. 2023 May 12;24(1):197. doi: 10.1186/s12859-023-05294-0.
Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.
大规模的人群变异数据通常用于筛选和辅助解释单个样本中的变异。这些方法并没有将人群信息直接纳入变异调用过程,并且通常仅限于通过牺牲召回率来提高精确率的筛选。在本研究中,我们开发了基于人群感知的 DeepVariant 模型,该模型使用来自 1000 基因组计划的新通道对等位基因频率进行编码。该模型减少了变异调用错误,提高了单个样本的准确性和召回率,并减少了罕见纯合子和致病性 clinvar 调用全队列。我们评估了使用特定人群或多样化参考面板的情况,发现使用多样化面板的准确性最高,这表明即使在人群与样本祖先匹配的情况下,使用大型多样化面板也优于单个人群。最后,我们表明,即使将祖先也从参考面板中排除,该优势也可以推广到与训练数据具有不同祖先的样本中。