在新一代和第三代测序分析中对变异调用程序进行基准测试。

Benchmarking variant callers in next-generation and third-generation sequencing analysis.

机构信息

Zhongshan Ophthalmic Center at Sun Yat-sen University and Annoroad Gene Technology (Beijing) Co., Ltd.

Annoroad Gene Technology (Beijing) Co., Ltd.

出版信息

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa148.

DOI:10.1093/bib/bbaa148

PMID:32698196

Abstract

DNA variants represent an important source of genetic variations among individuals. Next- generation sequencing (NGS) is the most popular technology for genome-wide variant calling. Third-generation sequencing (TGS) has also recently been used in genetic studies. Although many variant callers are available, no single caller can call both types of variants on NGS or TGS data with high sensitivity and specificity. In this study, we systematically evaluated 11 variant callers on 12 NGS and TGS datasets. For germline variant calling, we tested DNAseq and DNAscope modes from Sentieon, HaplotypeCaller mode from GATK and WGS mode from DeepVariant. All the four callers had comparable performance on NGS data and 30× coverage of WGS data was recommended. For germline variant calling on TGS data, we tested DNAseq mode from Sentieon, HaplotypeCaller mode from GATK and PACBIO mode from DeepVariant. All the three callers had similar performance in SNP calling, while DeepVariant outperformed the others in InDel calling. TGS detected more variants than NGS, particularly in complex and repetitive regions. For somatic variant calling on NGS, we tested TNscope and TNseq modes from Sentieon, MuTect2 mode from GATK, NeuSomatic, VarScan2, and Strelka2. TNscope and Mutect2 outperformed the other callers. A higher proportion of tumor sample purity (from 10 to 20%) significantly increased the recall value of calling. Finally, computational costs of the callers were compared and Sentieon required the least computational cost. These results suggest that careful selection of a tool and parameters is needed for accurate SNP or InDel calling under different scenarios.

摘要

DNA 变体代表个体间遗传变异的重要来源。下一代测序（NGS）是用于全基因组变异检测的最流行技术。第三代测序（TGS）最近也被用于遗传研究。虽然有许多变体调用者可供选择，但没有一个单一的调用者可以在 NGS 或 TGS 数据上以高灵敏度和特异性调用这两种类型的变体。在这项研究中，我们系统地评估了 11 种变体调用者在 12 种 NGS 和 TGS 数据集上的性能。对于种系变异调用，我们测试了 Sentieon 的 DNAseq 和 DNAscope 模式、GATK 的 HaplotypeCaller 模式和 DeepVariant 的 WGS 模式。所有这四个调用者在 NGS 数据上的性能相当，建议使用 30×的 WGS 数据覆盖。对于 TGS 数据的种系变异调用，我们测试了 Sentieon 的 DNAseq 模式、GATK 的 HaplotypeCaller 模式和 DeepVariant 的 PACBIO 模式。所有这三个调用者在 SNP 调用方面表现相似，而 DeepVariant 在 InDel 调用方面优于其他调用者。TGS 比 NGS 检测到更多的变体，特别是在复杂和重复区域。对于 NGS 上的体细胞变异调用，我们测试了 Sentieon 的 TNscope 和 TNseq 模式、GATK 的 MuTect2 模式、NeuSomatic、VarScan2 和 Strelka2。TNscope 和 Mutect2 优于其他调用者。肿瘤样本纯度（从 10%到 20%）的比例增加显著提高了调用的召回值。最后，比较了调用者的计算成本，Sentieon 需要的计算成本最少。这些结果表明，在不同的情况下，需要仔细选择工具和参数，以实现 SNP 或 InDel 调用的准确性。