Severance Biomedical Science Institute, Brain Korea 21 PLUS Project for Medical Sciences, Yonsei University College of Medicine, Seoul 03722, South Korea.
Graduate School of Medical Science and Engineering, KAIST, Daejeon 34141, South Korea.
Bioinformatics. 2016 Oct 15;32(20):3072-3080. doi: 10.1093/bioinformatics/btw383. Epub 2016 Jun 22.
Advances in sequencing technologies have remarkably lowered the detection limit of somatic variants to a low frequency. However, calling mutations at this range is still confounded by many factors including environmental contamination. Vector contamination is a continuously occurring issue and is especially problematic since vector inserts are hardly distinguishable from the sample sequences. Such inserts, which may harbor polymorphisms and engineered functional mutations, can result in calling false variants at corresponding sites. Numerous vector-screening methods have been developed, but none could handle contamination from inserts because they are focusing on vector backbone sequences alone.
We developed a novel method-Vecuum-that identifies vector-originated reads and resultant false variants. Since vector inserts are generally constructed from intron-less cDNAs, Vecuum identifies vector-originated reads by inspecting the clipping patterns at exon junctions. False variant calls are further detected based on the biased distribution of mutant alleles to vector-originated reads. Tests on simulated and spike-in experimental data validated that Vecuum could detect 93% of vector contaminants and could remove up to 87% of variant-like false calls with 100% precision. Application to public sequence datasets demonstrated the utility of Vecuum in detecting false variants resulting from various types of external contamination.
Java-based implementation of the method is available at http://vecuum.sourceforge.net/ CONTACT: swkim@yuhs.acSupplementary information: Supplementary data are available at Bioinformatics online.
测序技术的进步显著降低了体细胞变异的检测下限至低频率。然而,在这个范围内调用突变仍然受到许多因素的影响,包括环境污染。载体污染是一个持续存在的问题,尤其是因为载体插入物几乎无法与样本序列区分开来。这些插入物可能含有多态性和工程功能突变,可能导致在相应位点产生假变体。已经开发了许多载体筛选方法,但由于它们仅专注于载体骨架序列,因此没有一种方法可以处理来自插入物的污染。
我们开发了一种新的方法-Vecuum-,它可以识别源自载体的读取序列和由此产生的假变体。由于载体插入物通常由无内含子的 cDNA 构建,因此 Vecuum 通过检查外显子连接处的剪辑模式来识别源自载体的读取序列。根据突变等位基因向源自载体的读取序列的偏置分布,进一步检测假变体调用。对模拟和 Spike-in 实验数据的测试验证了 Vecuum 可以检测到 93%的载体污染物,并可以去除高达 87%的具有 100%精度的类似变体的假呼叫。将其应用于公共序列数据集表明了 Vecuum 在检测由于各种类型的外部污染而导致的假变体方面的实用性。
该方法的基于 Java 的实现可在 http://vecuum.sourceforge.net/ 上获得。
补充数据可在 Bioinformatics 在线获得。