Department of Computer Science and Engineering, Michigan State University, East Lansing, 48824, MI, USA.
Electronic Engineering Department, City University of Hong Kong, Hong Kong SAR, China.
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.
In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads' overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.
GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK .
太平洋生物科学公司(Pacific BioSciences)开发的单分子实时测序(SMRT)技术产生的读长比第二代测序技术(如 Illumina)更长。更长的读长使 PacBio 测序能够缩小基因组组装中的缺口,揭示结构变异,并描述种内变异。它还有望破解复杂微生物群落中的群落结构,因为长读长有助于宏基因组组装。使用长读长进行基因组组装的关键步骤之一是快速识别形成重叠的读长。由于 PacBio 数据的测序错误率和覆盖率高于流行的短读测序技术(如 Illumina),因此需要专门设计的算法来有效检测真正的重叠。特别是,仍需要提高检测读长中较小重叠或高错误率重叠的灵敏度。满足这一需求将能够更好地组装第三代测序技术产生的宏基因组数据。
在这项工作中,我们设计并实现了一种名为 GroupK 的重叠检测程序,用于基于分组 k-mer 命中的第三代测序读长。虽然已有几个现有程序采用 k-mer 命中来检测读长的重叠,但我们的方法使用一组满足统计衍生距离约束的短 k-mer 命中来提高小重叠检测的灵敏度。分组 k-mer 命中最初是为同源搜索设计的。我们是第一个将组命中应用于长读长重叠检测的人。将我们的流水线应用于模拟和真实第三代测序数据的实验结果表明,GroupK 能够更灵敏地检测重叠,特别是在测序覆盖率较低的数据集上。
GroupK 最适合用于检测第三代测序数据的小重叠。它为现有工具提供了一个有用的补充,用于更灵敏和准确的重叠检测。源代码可在 https://github.com/Strideradu/GroupK 免费获得。