Ni Shengyu, Stoneking Mark
Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, D04103, Germany.
BMC Genomics. 2016 Feb 27;17:139. doi: 10.1186/s12864-016-2463-2.
Minor allele detection in very high coverage sequence data (>1000X) has many applications such as detecting mtDNA heteroplasmy, somatic mutations in cancer or tumors, SNP calling in pool sequencing, etc., where reads with low frequency are not necessarily sequence error but may instead convey biological information. However, the suitability of common base quality recalibration tools for such applications has not been investigated in detail.
We show that the widely used tool GATK BaseRecalibration has several limitations in minor allele detection. First, GATK IndelRealignment fails to work if the sequence coverage is above a certain level since it then becomes computationally infeasible. Second, the accuracy of the base quality largely depends on the database of known SNPs as the control, which limits the ability of de novo minor allele detection. Third, GATK reduces the base quality of sequence errors at the cost of reducing scores for true minor alleles. To overcome these limitations, we present a novel approach called SEGREG, which applies segmented regression to control sequences (e.g. phiX174 DNA) spiked into a sequencing run. Based on simulations SEGREG improves both the accuracy of base quality scores and the detection of minor alleles. We further investigate sequence error and recalibration parameters by applying a Logarithm Likelihood Ratio (LLR) approach to SEGREG recalibrated base quality scores for phiX174 DNA sequenced to very high coverage, and for mtDNA genome sequences previously analyzed for heteroplasmic variants.
Our results suggest that SEGREG improves base recalibration without suffering the limitations discussed above, and the LLR approach benefits from SEGREG in identifying more true minor alleles, while avoiding false positives from sequencing error.
在超高覆盖度序列数据(>1000X)中检测次要等位基因有许多应用,如检测线粒体DNA异质性、癌症或肿瘤中的体细胞突变、混合测序中的单核苷酸多态性(SNP)分型等,其中低频读数不一定是序列错误,反而可能传达生物学信息。然而,常用的碱基质量重新校准工具在这类应用中的适用性尚未得到详细研究。
我们表明,广泛使用的工具GATK碱基重新校准在次要等位基因检测方面存在若干局限性。首先,如果序列覆盖度高于一定水平,GATK插入缺失重新比对就无法工作,因为此时计算上不可行。其次,碱基质量的准确性在很大程度上取决于作为对照的已知SNP数据库,这限制了从头检测次要等位基因的能力。第三,GATK以降低真正次要等位基因的分数为代价来降低序列错误的碱基质量。为克服这些局限性,我们提出了一种名为SEGREG的新方法,该方法将分段回归应用于掺入测序运行中的对照序列(如phiX174 DNA)。基于模拟,SEGREG提高了碱基质量分数的准确性和次要等位基因的检测能力。我们通过将对数似然比(LLR)方法应用于对超高覆盖度测序的phiX174 DNA以及先前分析过异质变体的线粒体DNA基因组序列进行SEGREG重新校准的碱基质量分数,进一步研究了序列错误和重新校准参数。
我们的结果表明,SEGREG在不遭受上述局限性的情况下改进了碱基重新校准,并且LLR方法在识别更多真正次要等位基因方面受益于SEGREG,同时避免了测序错误导致的假阳性。