ntCard：一种用于基因组数据基数估计的流算法。

ntCard: a streaming algorithm for cardinality estimation in genomics data.

作者信息

Mohamadi Hamid, Khan Hamza, Birol Inanc

机构信息

Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.

Faculty of Science, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.

出版信息

Bioinformatics. 2017 May 1;33(9):1324-1330. doi: 10.1093/bioinformatics/btw832.

DOI:10.1093/bioinformatics/btw832

PMID:28453674

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5408799/

Abstract

MOTIVATION

Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task.

RESULTS

Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.

AVAILABILITY AND IMPLEMENTATION

ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard.

CONTACT

hmohamadi@bcgsc.ca or ibirol@bcgsc.ca.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

许多生物信息学算法旨在分析某些固定长度的序列，传统上称为k - 元组。这些算法包括德布鲁因图组装方法和序列比对工具。对于这些工具及其下游分析流程而言，一种高效的算法来枚举唯一k - 元组的数量，或者更好地构建k - 元组频率直方图将是很有必要的。在其他应用中，估计的频率可用于预测基因组大小、测量测序错误率以及调整分析工具的运行时参数。然而，从大量测序数据中计算k - 元组直方图是一项具有挑战性的任务。

结果

在此，我们展示了ntCard，一种用于估计基因组数据集k - 元组频率的流式算法。其核心是，ntCard使用ntHash算法为流式序列高效计算哈希值。然后对计算出的哈希值进行采样，以构建描述样本分布的简化表示多重性表。最后，它使用统计模型从样本分布中重建总体分布。我们比较了ntCard和其他基数估计算法的性能。我们使用了三个大小分别为480GB、500GB和2.4TB的数据集，前两个代表人类基因组的全基因组鸟枪法测序实验，最后一个代表白云杉基因组的测序实验。结果表明，ntCard估计k - 元组覆盖频率的速度比现有最先进算法快15倍以上，使用的内存量相似，且准确率更高。因此，我们的基准测试表明ntCard是大规模基因组学应用的一种潜在使能技术。