Liao Xiangyu, Liao Xingyu, Zhu Wufei, Fang Lu, Chen Xing
Department of Oncology,The First College of Clinical Medical Science,China Three Gorges University,Yichang Central People's Hospital,Yichang,Hubei 443000,P.R. China.
School of Information Science and Engineering,Central South University,Changsha,Hunan 410083,P.R. China.
Genet Res (Camb). 2018 Sep 17;100:e8. doi: 10.1017/S0016672318000058.
With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.
随着高通量测序技术的进步,可用测序数据量正以惊人的速度增长,这已开始对现代计算机系统的数据处理和存储能力构成巨大挑战。通过聚类去除此类数据中的冗余对于减少内存、磁盘空间和运行时间消耗可能至关重要。此外,在某些分析应用中,它在减少数据集噪声方面也具有良好性能。在本研究中,我们基于高效哈希函数和文本相似度,提出了一种用于下一代测序(NGS)数据的高性能短序列分类算法(HSC)。首先,HSC将所有读段转换为k-mer,然后通过合并重复和反向互补元素形成唯一的k-mer集合。其次,将所有唯一的k-mer存储在哈希表中,其中k-mer字符串存储在键字段中,包含该k-mer的读段ID存储在值字段中。第三,将每个哈希单元转换为由读段组成的短文本。第四,将满足相似度阈值的文本合并成长文本,迭代执行合并操作,直到没有满足合并条件的文本。最后,将长文本转换为由读段组成的簇。我们使用五个真实数据集对HSC进行了测试。实验结果表明,HSC能在2小时内对1亿条短读段进行聚类,并且在减少内存消耗方面具有出色性能。与现有方法相比,HSC比其他工具快得多,它能轻松处理数千万条序列。此外,当将HSC用作预处理工具来生成组装数据时,组装器的内存和时间消耗会大大降低。在N50、NA50和基因组分数方面,它可以帮助组装器实现更好的组装效果。