Akogwu Isaac, Wang Nan, Zhang Chaoyang, Gong Ping
School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.
Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0.
Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.
Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method.
Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets.
This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
高通量新一代测序(NGS)技术的进步催生了无数新的基因组研究机会。然而,NGS数据丰富带来的一个问题是,在下游分析过程中难以区分真正的生物学变异和序列错误改变。人们已经开发了许多纠错方法,用于在进一步分析之前纠正错误的NGS读段,但对于读长、基因组大小和覆盖深度等数据集特征对其性能的影响,缺乏独立评估。这项比较研究旨在探究一些最新的基于k谱的方法的优缺点及局限性,并为用户根据特定的NGS数据集选择合适的方法提供建议。
使用六组模拟的双端Illumina测序数据,对六种基于k谱的方法,即Reptile、Musket、Bless、Bloocoo、Lighter和Trowel进行比较。这些NGS数据集在读长(36至100bp)、覆盖深度(10×至120×)和基因组大小(4.6至143MB)方面各不相同。使用纠错评估工具包(ECET)得出一套指标(即真阳性、假阳性、假阴性、召回率、精确率、增益和F值),以评估每种方法的校正质量。
计算实验结果表明,在六个数据集中所检测的变异范围内,Musket的整体性能最佳。Musket的最低准确率(F值=0.81)出现在一个中等读长(56bp)、中等覆盖度(50×)和小基因组大小(5.4MB)的数据集上。其他五种方法表现较差(F值<0.80)和/或无法处理一个或多个数据集。
本研究表明,覆盖深度、读长和基因组大小等多种因素可能会影响基于k谱的单个纠错方法的性能。因此,必须努力为特定的NGS数据集选择合适的纠错方法。基于我们的比较研究,我们推荐Musket作为首选,因为它在所有六个测试数据集中都始终具有卓越的性能。有必要进行进一步的广泛研究,以在更多样化的参数设置(k-mer值和编辑距离)下,使用NGS平台(如454、SOLiD和Ion Torrent)生成的实验数据集评估这些方法,并将它们与其他非基于k谱的纠错方法类别进行比较。