National Institute of Biomedical Genomics, Kalyani, 741251, West Bengal, India.
Nucleic Acids Res. 2023 Aug 11;51(14):e75. doi: 10.1093/nar/gkad539.
High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting 'low-confidence' variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls.
高通量测序(HTS)通过实现碱基分辨率的基因组变异超快检测,彻底改变了科学。因此,它提出了识别技术伪影的具有挑战性的问题,即隐藏的非随机错误模式。理解测序伪影的特性是将真正的变异与假阳性区分开来的关键。在这里,我们开发了 Mapinsights,这是一个执行序列比对文件质量控制(QC)分析的工具包,与现有方法相比,它能够以更深的分辨率检测基于 HTS 数据的测序伪影的异常值。Mapinsights 根据源自序列比对的新颖和现有 QC 特征进行聚类分析,以进行异常值检测。我们将 Mapinsights 应用于社区标准的开源数据集,并确定了各种质量问题,包括与测序循环、测序化学、测序文库以及各种正交测序平台相关的技术错误。Mapinsights 还能够识别与测序深度相关的异常值。基于 Mapinsights 特征构建的逻辑回归模型在检测“低可信度”变异位点方面具有很高的准确性。Mapinsights 提供的定量估计和概率论证可用于识别错误、偏差和异常值样本,并有助于提高变异调用的真实性。