Kuo Tony, Frith Martin C, Sese Jun, Horton Paul
Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo, 152-8550, Japan.
BMC Med Genomics. 2018 Apr 20;11(Suppl 2):28. doi: 10.1186/s12920-018-0342-1.
Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options.
Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual's genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark.
EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle .
从单样本DNA测序数据中可靠地检测基因组变异,尤其是插入和缺失(indels),仍然具有挑战性,部分原因是将测序读数与参考基因组比对时存在固有的不确定性。在实践中,人们采用了各种临时的质量过滤方法来生成更可靠的假定变异列表,但生成的列表通常仍包含大量假阳性。因此,能够严格评估每个假定变异受数据支持的程度将是很有必要的。不幸的是,希望这样做的用户,例如为了对验证实验进行优先级排序,面临的选择有限。
在此我们展示了EAGLE,一种用于评估测序数据支持给定候选基因组变异程度的方法。EAGLE将候选变异纳入关于个体基因组的明确假设中,然后计算每个假设下观察到的数据(测序读数)的概率。与严重依赖读数与参考基因组的特定比对的方法相比,EAGLE很容易考虑到多映射或局部错配可能产生的不确定性,并使用每个读数的全长。我们将几种知名变异检测工具分配的分数与EAGLE在模拟数据和基于真实基因组测序的基准上对真正假定变异进行排名的任务进行了比较。对于indels,EAGLE在模拟数据和全基因组测序基准上有显著改进,在外显子组测序基准上有适度但具有统计学意义的改进。
EAGLE对真正变异的排名高于检测工具报告的分数,可用于提高变异检测的特异性。EAGLE可在https://github.com/tony - kuo/eagle上免费获取。