Color Genomics, 831 Mitten Road, Burlingame, CA, 94010, USA.
BMC Genomics. 2018 Apr 17;19(1):263. doi: 10.1186/s12864-018-4659-0.
Next generation sequencing (NGS) has become a common technology for clinical genetic tests. The quality of NGS calls varies widely and is influenced by features like reference sequence characteristics, read depth, and mapping accuracy. With recent advances in NGS technology and software tools, the majority of variants called using NGS alone are in fact accurate and reliable. However, a small subset of difficult-to-call variants that still do require orthogonal confirmation exist. For this reason, many clinical laboratories confirm NGS results using orthogonal technologies such as Sanger sequencing. Here, we report the development of a deterministic machine-learning-based model to differentiate between these two types of variant calls: those that do not require confirmation using an orthogonal technology (high confidence), and those that require additional quality testing (low confidence). This approach allows reliable NGS-based calling in a clinical setting by identifying the few important variant calls that require orthogonal confirmation.
We developed and tested the model using a set of 7179 variants identified by a targeted NGS panel and re-tested by Sanger sequencing. The model incorporated several signals of sequence characteristics and call quality to determine if a variant was identified at high or low confidence. The model was tuned to eliminate false positives, defined as variants that were called by NGS but not confirmed by Sanger sequencing. The model achieved very high accuracy: 99.4% (95% confidence interval: +/- 0.03%). It categorized 92.2% (6622/7179) of the variants as high confidence, and 100% of these were confirmed to be present by Sanger sequencing. Among the variants that were categorized as low confidence, defined as NGS calls of low quality that are likely to be artifacts, 92.1% (513/557) were found to be not present by Sanger sequencing.
This work shows that NGS data contains sufficient characteristics for a machine-learning-based model to differentiate low from high confidence variants. Additionally, it reveals the importance of incorporating site-specific features as well as variant call features in such a model.
下一代测序(NGS)已成为临床基因检测的常用技术。NGS 调用的质量差异很大,受到参考序列特征、读取深度和映射准确性等因素的影响。随着 NGS 技术和软件工具的最新进展,使用 NGS 单独调用的大多数变体实际上是准确和可靠的。然而,仍然存在一小部分难以调用的变体,仍然需要正交确认。出于这个原因,许多临床实验室使用 Sanger 测序等正交技术来确认 NGS 结果。在这里,我们报告了开发一种基于确定性机器学习的模型的情况,以区分这两种类型的变体调用:不需要使用正交技术确认的那些(高可信度),以及需要额外质量测试的那些(低可信度)。这种方法通过识别需要正交确认的少数重要变体调用,允许在临床环境中进行可靠的基于 NGS 的调用。
我们使用一组由靶向 NGS 面板识别并通过 Sanger 测序重新测试的 7179 个变体开发并测试了该模型。该模型结合了几个序列特征和调用质量信号,以确定变体是被高可信度还是低可信度识别。该模型经过调整以消除假阳性,定义为被 NGS 调用但未被 Sanger 测序确认的变体。该模型达到了非常高的准确性:99.4%(95%置信区间:+/-0.03%)。它将 7179 个变体中的 92.2%(6622/7179)归类为高可信度,并且这些变体中的 100%通过 Sanger 测序被证实存在。在被归类为低可信度的变体中,定义为 NGS 调用质量低且可能是伪影的变体,92.1%(513/557)通过 Sanger 测序被发现不存在。
这项工作表明,NGS 数据包含足够的特征,可让基于机器学习的模型区分低可信度和高可信度变体。此外,它揭示了在这种模型中纳入特定于站点的特征和变体调用特征的重要性。