Prezza Nicola, Pisanti Nadia, Sciortino Marinella, Rosone Giovanna
1Dipartimento di Informatica, University of Pisa, Pisa, Italy.
3ERABLE Team, INRIA, Lyon, France.
Algorithms Mol Biol. 2019 Feb 6;14:3. doi: 10.1186/s13015-019-0137-8. eCollection 2019.
Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data.
We develop the theory that (i) describes how the extended Burrows-Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP.
Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data.
The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.
测序技术持续变得更便宜、更快速,因此给旨在高效存储原始数据并可能在其中进行分析的数据结构带来了越来越大的压力。从这个角度来看,对仅利用(经过适当索引的)原始读取数据的无比对和无参考变异调用方法的兴趣与日俱增。
我们开发了一种理论,该理论(i)描述了一组读取序列的扩展布隆斯 - 惠勒变换(eBWT)如何倾向于将覆盖相同基因组位置的碱基聚集在一起,(ii)预测此类聚类的大小,以及(iii)展示了一种基于优雅且精确的LCP数组的过程,用于在eBWT中定位此类聚类。基于此理论,我们设计并实现了一种无比对和无参考的单核苷酸多态性(SNP)调用方法,并设计了相应的SNP调用流程。对合成数据和真实数据的实验表明,按照我们的理论框架,通过简单扫描eBWT和LCP数组就可以检测到SNP,因为它们位于读取序列的eBWT中的聚类内。最后,我们的工具通过返回每个SNP的覆盖度,本质上对其准确性进行了无参考评估。
基于对合成数据和真实数据的实验结果,我们得出结论,位置聚类框架可有效地用于识别SNP的问题,并且它似乎是一种直接在原始测序数据上调用其他类型变异的有前途的方法。
软件ebwt2snp可在https://github.com/nicolaprezza/ebwt2snp上免费用于学术用途。