Hasan Mohammad Shabbir, Wu Xiaowei, Watson Layne T, Zhang Liqing
Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
Department of Statistics, Virginia Tech, Blacksburg, VA, 24061, USA.
Sci Rep. 2017 Oct 26;7(1):14106. doi: 10.1038/s41598-017-14400-1.
Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.
将生物学上等效的插入缺失作为不同条目存储在数据库中会导致数据冗余,并误导下游分析。因此,需要一个统一的系统来识别和表示等效的插入缺失。此外,还需要一个统一的系统来比较不同工具产生的插入缺失调用结果。本文介绍了UPS-indel,这是一个实用工具,它为插入缺失创建了一个通用定位系统,以便等效的插入缺失可以通过它们在新系统中的坐标唯一确定,该系统还可用于比较不同的插入缺失调用结果。UPS-indel在所有人类染色体的dbSNP中识别出15%的冗余插入缺失,在COSMIC编码中为29%,在COSMIC非编码数据集中为13%,高于先前报道的比例。将UPS-indel与现有的变异标准化工具vt normalize、BCFtools和GATK LeftAlignAndTrimVariants的性能进行比较,结果表明,除了这些工具共同报告的冗余插入缺失外,UPS-indel在dbSNP中还能够识别出多456,352个冗余插入缺失;在COSMIC编码中多2,118个,在COSMIC非编码插入缺失数据集中多553个。此外,将UPS-indel与用于插入缺失调用集比较的最先进方法进行比较,证明了它在查找调用集之间的常见插入缺失方面具有明显优势。理论上证明,UPS-indel可以找到所有等效的插入缺失,因此是详尽无遗的。