Jiang Hongshan, Lei Rong, Ding Shou-Wei, Zhu Shuifang
Institute of Plant Quarantine Research, Chinese Academy of Inspection and Quarantine, Huixinli 241, Beijing, 100029 China.
BMC Bioinformatics. 2014 Jun 12;15:182. doi: 10.1186/1471-2105-15-182.
Adapter trimming is a prerequisite step for analyzing next-generation sequencing (NGS) data when the reads are longer than the target DNA/RNA fragments. Although typically used in small RNA sequencing, adapter trimming is also used widely in other applications, such as genome DNA sequencing and transcriptome RNA/cDNA sequencing, where fragments shorter than a read are sometimes obtained because of the limitations of NGS protocols. For the newly emerged Nextera long mate-pair (LMP) protocol, junction adapters are located in the middle of all properly constructed fragments; hence, adapter trimming is essential to gain the correct paired reads. However, our investigations have shown that few adapter trimming tools meet both efficiency and accuracy requirements simultaneously. The performances of these tools can be even worse for paired-end and/or mate-pair sequencing.
To improve the efficiency of adapter trimming, we devised a novel algorithm, the bit-masked k-difference matching algorithm, which has O(kn) expected time with O(m) space, where k is the maximum number of differences allowed, n is the read length, and m is the adapter length. This algorithm makes it possible to fully enumerate all candidates that meet a specified threshold, e.g. error ratio, within a short period of time. To improve the accuracy of this algorithm, we designed a simple and easy-to-explain statistical scoring scheme to evaluate candidates in the pattern matching step. We also devised scoring schemes to fully exploit the paired-end/mate-pair information when it is applicable. All these features have been implemented in an industry-standard tool named Skewer (https://sourceforge.net/projects/skewer). Experiments on simulated data, real data of small RNA sequencing, paired-end RNA sequencing, and Nextera LMP sequencing showed that Skewer outperforms all other similar tools that have the same utility. Further, Skewer is considerably faster than other tools that have comparative accuracies; namely, one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing.
Skewer achieved as yet unmatched accuracies for adapter trimming with low time bound.
当读取片段长于目标DNA/RNA片段时,接头修剪是分析下一代测序(NGS)数据的一个前提步骤。接头修剪虽然通常用于小RNA测序,但也广泛应用于其他应用中,如基因组DNA测序和转录组RNA/cDNA测序,在这些应用中,由于NGS协议的限制,有时会获得比读取片段短的片段。对于新出现的Nextera长片段配对(LMP)协议,接头位于所有正确构建的片段中间;因此,接头修剪对于获得正确的配对读取片段至关重要。然而,我们的研究表明,很少有接头修剪工具能同时满足效率和准确性要求。对于双端和/或片段配对测序,这些工具的性能可能更差。
为了提高接头修剪的效率,我们设计了一种新颖的算法,即位掩码k差异匹配算法,其预期时间为O(kn),空间为O(m),其中k是允许的最大差异数,n是读取片段长度,m是接头长度。该算法使得在短时间内能够完全枚举所有满足指定阈值(如错误率)的候选序列。为了提高该算法的准确性,我们设计了一种简单且易于解释的统计评分方案,用于在模式匹配步骤中评估候选序列。我们还设计了评分方案,以便在适用时充分利用双端/片段配对信息。所有这些特性都已在一个名为Skewer(https://sourceforge.net/projects/skewer)的行业标准工具中实现。对模拟数据、小RNA测序的真实数据、双端RNA测序和Nextera LMP测序的实验表明,Skewer优于所有其他具有相同功能的类似工具。此外,Skewer比其他具有可比准确性的工具快得多;即单端测序快一倍,双端测序快12倍以上,LMP测序快49%。
Skewer在接头修剪方面实现了前所未有的准确性,且时间限制较低。