Tran Quang, Gao Shanshan, Phan Vinhthuy
Department of Computer Science, University of Memphis, Memphis, 38152, TN, USA.
BMC Bioinformatics. 2016 Oct 6;17(Suppl 13):349. doi: 10.1186/s12859-016-1216-1.
Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners' called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.
国际人类基因组单体型图计划(International HapMap Project)和千人基因组计划(1000 Genomes Project)等项目成果形成了一份包含数百万个人类单核苷酸和插入/缺失(INDEL)变异的目录。作为现有变异的参考,该资源通常作为研究和开发检测遗传变异方法的金标准。我们的分析表明,该参考包含数千个以有偏方式构建的INDEL。这种偏差发生在将短读段与参考基因组比对以检测变异的层面。偏差是由参考基因组与那些INDEL位置含有替代等位基因的读段之间存在许多理论上的最优比对所导致的。我们研究了几种常用的比对器,并表明这些比对器可分为几类,其比对产生的INDEL与报告的INDEL高度一致或高度不一致。这一发现表明,比对器调用的INDEL与报告的INDEL之间的一致或不一致仅仅是任意选择一种最优比对的结果。INDEL调用中偏差的存在可能会对下游分析产生严重影响。因此,我们的发现表明这一现象应得到进一步解决。