Shanghai Key Lab of Intelligent Information Processing, Shanghai, China.
School of Computer Science and Technology, Fudan University, Shanghai, China.
BMC Bioinformatics. 2018 Apr 19;19(1):145. doi: 10.1186/s12859-018-2147-9.
The rapid development of next-generation sequencing (NGS) technology has continuously been refreshing the throughput of sequencing data. However, due to the lack of a smart tool that is both fast and accurate, the analysis task for NGS data, especially those with low-coverage, remains challenging.
We proposed a decision-tree based variant calling algorithm. Experiments on a set of real data indicate that our algorithm achieves high accuracy and sensitivity for SNVs and indels and shows good adaptability on low-coverage data. In particular, our algorithm is obviously faster than 3 widely used tools in our experiments.
We implemented our algorithm in a software named Fuwa and applied it together with 4 well-known variant callers, i.e., Platypus, GATK-UnifiedGenotyper, GATK-HaplotypeCaller and SAMtools, to three sequencing data sets of a well-studied sample NA12878, which were produced by whole-genome, whole-exome and low-coverage whole-genome sequencing technology respectively. We also conducted additional experiments on the WGS data of 4 newly released samples that have not been used to populate dbSNP.
下一代测序(NGS)技术的快速发展不断刷新着测序数据的通量。然而,由于缺乏既快速又准确的智能工具,NGS 数据的分析任务,特别是低覆盖度数据的分析任务,仍然具有挑战性。
我们提出了一种基于决策树的变异调用算法。在一组真实数据上的实验表明,我们的算法在 SNV 和 indel 上具有很高的准确性和灵敏度,并对低覆盖度数据具有良好的适应性。特别是,我们的算法在实验中明显快于 3 种常用的工具。
我们在 Fuwa 软件中实现了我们的算法,并将其与 4 种著名的变异调用器(即 Platypus、GATK-UnifiedGenotyper、GATK-HaplotypeCaller 和 SAMtools)一起应用于经过全基因组、全外显子组和低覆盖度全基因组测序技术分别产生的一个经过充分研究的样本 NA12878 的三个测序数据集。我们还在 4 个新发布的样本的 WGS 数据上进行了额外的实验,这些样本没有用于填充 dbSNP。