Li Zhiyi, Wu Xiaowei, He Bin, Zhang Liqing
Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
Department of Statistics, Virginia Tech, Blacksburg, VA, 24061, USA.
BMC Bioinformatics. 2014 Nov 19;15(1):359. doi: 10.1186/s12859-014-0359-1.
With the advance of next generation sequencing (NGS) technologies, a large number of insertion and deletion (indel) variants have been identified in human populations. Despite much research into variant calling, it has been found that a non-negligible proportion of the identified indel variants might be false positives due to sequencing errors, artifacts caused by ambiguous alignments, and annotation errors.
In this paper, we examine indel redundancy in dbSNP, one of the central databases for indel variants, and develop a standalone computational pipeline, dubbed Vindel, to detect redundant indels. The pipeline first applies indel position information to form candidate redundant groups, then performs indel mutations to the reference genome to generate corresponding indel variant substrings. Finally the indel variant substrings in the same candidate redundant groups are compared in a pairwise fashion to identify redundant indels. We applied our pipeline to check for redundancy in the human indels in dbSNP. Our pipeline identified approximately 8% redundancy in insertion type indels, 12% in deletion type indels, and overall 10% for insertions and deletions combined. These numbers are largely consistent across all human autosomes. We also investigated indel size distribution and adjacent indel distance distribution for a better understanding of the mechanisms generating indel variants.
Vindel, a simple yet effective computational pipeline, can be used to check whether a set of indels are redundant with respect to those already in the database of interest such as NCBI's dbSNP. Of the approximately 5.9 million indels we examined, nearly 0.6 million are redundant, revealing a serious limitation in the current indel annotation. Statistics results prove the consistency of the pipeline on indel redundancy detection for all 22 chromosomes. Apart from the standalone Vindel pipeline, the indel redundancy check algorithm is also implemented in the web server http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php .
随着下一代测序(NGS)技术的发展,人类群体中已鉴定出大量插入和缺失(indel)变异。尽管对变异检测进行了大量研究,但已发现由于测序错误、模糊比对导致的假象以及注释错误,所鉴定的indel变异中有不可忽视的一部分可能是假阳性。
在本文中,我们研究了indel变异的核心数据库之一dbSNP中的indel冗余情况,并开发了一个名为Vindel的独立计算流程来检测冗余indel。该流程首先应用indel位置信息形成候选冗余组,然后对参考基因组进行indel突变以生成相应的indel变异子串。最后,以成对方式比较同一候选冗余组中的indel变异子串以识别冗余indel。我们应用该流程检查dbSNP中人类indel的冗余情况。我们的流程在插入型indel中识别出约8%的冗余,在缺失型indel中为12%,插入和缺失合并后的总体冗余率为10%。这些数字在所有人类常染色体上基本一致。我们还研究了indel大小分布和相邻indel距离分布,以更好地理解产生indel变异的机制。
Vindel是一个简单而有效的计算流程,可用于检查一组indel相对于感兴趣数据库(如NCBI的dbSNP)中已有indel是否冗余。在我们检查的约590万个indel中,近60万个是冗余的,这揭示了当前indel注释中的一个严重局限性。统计结果证明了该流程在所有22条染色体上进行indel冗余检测的一致性。除了独立的Vindel流程外,indel冗余检查算法也在网页服务器http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php上实现。