Center for Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
Bioinformatics. 2016 Jun 1;32(11):1625-31. doi: 10.1093/bioinformatics/btv662. Epub 2015 Nov 14.
Single Molecule Real-Time (SMRT) sequencing has been widely applied in cutting-edge genomic studies. However, it is still an expensive task to align the noisy long SMRT reads to reference genome by state-of-the-art aligners, which is becoming a bottleneck in applications with SMRT sequencing. Novel approach is on demand for improving the efficiency and effectiveness of SMRT read alignment.
We propose Regional Hashing-based Alignment Tool (rHAT), a seed-and-extension-based read alignment approach specifically designed for noisy long reads. rHAT indexes reference genome by regional hash table (RHT), a hash table-based index which describes the short tokens within local windows of reference genome. In the seeding phase, rHAT utilizes RHT for efficiently calculating the occurrences of short token matches between partial read and local genomic windows to find highly possible candidate sites. In the extension phase, a sparse dynamic programming-based heuristic approach is used for reducing the cost of aligning read to the candidate sites. By benchmarking on the real and simulated datasets from various prokaryote and eukaryote genomes, we demonstrated that rHAT can effectively align SMRT reads with outstanding throughput.
rHAT is implemented in C++; the source code is available at https://github.com/HIT-Bioinformatics/rHAT CONTACT: ydwang@hit.edu.cn
Supplementary data are available at Bioinformatics online.
单分子实时 (SMRT) 测序已广泛应用于前沿基因组研究。然而,通过最先进的对齐器将嘈杂的长 SMRT 读取与参考基因组对齐仍然是一项昂贵的任务,这在 SMRT 测序的应用中成为了一个瓶颈。需要新的方法来提高 SMRT 读取对齐的效率和效果。
我们提出了基于区域哈希的对齐工具 (rHAT),这是一种专门为嘈杂的长读取设计的基于种子和扩展的读取对齐方法。rHAT 通过区域哈希表 (RHT) 对参考基因组进行索引,RHT 是一种基于哈希表的索引,描述了参考基因组局部窗口内的短标记。在种子阶段,rHAT 利用 RHT 高效地计算部分读取和局部基因组窗口之间短标记匹配的出现次数,以找到高度可能的候选位点。在扩展阶段,使用稀疏动态规划启发式方法来降低将读取与候选位点对齐的成本。通过在来自各种原核生物和真核生物基因组的真实和模拟数据集上进行基准测试,我们证明 rHAT 可以有效地对齐 SMRT 读取,具有出色的吞吐量。
rHAT 是用 C++实现的;源代码可在 https://github.com/HIT-Bioinformatics/rHAT 上获得。
补充数据可在生物信息学在线获得。