Suppr超能文献

rHAT:基于区域哈希的快速对齐含噪长读。

rHAT: fast alignment of noisy long reads with regional hashing.

机构信息

Center for Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.

出版信息

Bioinformatics. 2016 Jun 1;32(11):1625-31. doi: 10.1093/bioinformatics/btv662. Epub 2015 Nov 14.

Abstract

MOTIVATION

Single Molecule Real-Time (SMRT) sequencing has been widely applied in cutting-edge genomic studies. However, it is still an expensive task to align the noisy long SMRT reads to reference genome by state-of-the-art aligners, which is becoming a bottleneck in applications with SMRT sequencing. Novel approach is on demand for improving the efficiency and effectiveness of SMRT read alignment.

RESULTS

We propose Regional Hashing-based Alignment Tool (rHAT), a seed-and-extension-based read alignment approach specifically designed for noisy long reads. rHAT indexes reference genome by regional hash table (RHT), a hash table-based index which describes the short tokens within local windows of reference genome. In the seeding phase, rHAT utilizes RHT for efficiently calculating the occurrences of short token matches between partial read and local genomic windows to find highly possible candidate sites. In the extension phase, a sparse dynamic programming-based heuristic approach is used for reducing the cost of aligning read to the candidate sites. By benchmarking on the real and simulated datasets from various prokaryote and eukaryote genomes, we demonstrated that rHAT can effectively align SMRT reads with outstanding throughput.

AVAILABILITY AND IMPLEMENTATION

rHAT is implemented in C++; the source code is available at https://github.com/HIT-Bioinformatics/rHAT CONTACT: ydwang@hit.edu.cn

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

单分子实时 (SMRT) 测序已广泛应用于前沿基因组研究。然而,通过最先进的对齐器将嘈杂的长 SMRT 读取与参考基因组对齐仍然是一项昂贵的任务,这在 SMRT 测序的应用中成为了一个瓶颈。需要新的方法来提高 SMRT 读取对齐的效率和效果。

结果

我们提出了基于区域哈希的对齐工具 (rHAT),这是一种专门为嘈杂的长读取设计的基于种子和扩展的读取对齐方法。rHAT 通过区域哈希表 (RHT) 对参考基因组进行索引,RHT 是一种基于哈希表的索引,描述了参考基因组局部窗口内的短标记。在种子阶段,rHAT 利用 RHT 高效地计算部分读取和局部基因组窗口之间短标记匹配的出现次数,以找到高度可能的候选位点。在扩展阶段,使用稀疏动态规划启发式方法来降低将读取与候选位点对齐的成本。通过在来自各种原核生物和真核生物基因组的真实和模拟数据集上进行基准测试,我们证明 rHAT 可以有效地对齐 SMRT 读取,具有出色的吞吐量。

可用性和实现

rHAT 是用 C++实现的;源代码可在 https://github.com/HIT-Bioinformatics/rHAT 上获得。

联系人

ydwang@hit.edu.cn

补充信息

补充数据可在生物信息学在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验