Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China.
School of Computing and Electronic Information, Guangxi University, Nanning, China.
BMC Genomics. 2018 Dec 31;19(Suppl 10):912. doi: 10.1186/s12864-018-5272-y.
NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.
We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.
The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
NGS 数据包含许多机器诱导的错误。最先进的纠错方法严重依赖于固体 k-mer 的选择。固体 k-mer 是在 NGS 读取中频繁出现的 k-mer。其他 k-mer 称为弱 k-mer。固体 k-mer 不太可能包含错误,而弱 k-mer 很可能包含错误。一个深入研究的问题是找到一个好的频率截止值 f 来平衡固体和弱 k-mer 的数量。一旦确定了截止值,一个更具挑战性但研究较少的问题是:(i)删除可能包含错误的一小部分固体 k-mer,(ii)将一小部分可能不包含错误的弱 k-mer 添加到剩余的固体 k-mer 集合中。识别这两个 k-mer 子集可以提高纠错性能。
我们提出使用伽马分布来建模错误 k-mer 的频率,使用混合高斯分布来建模正确的 k-mer,并将它们结合起来确定 f。为了识别这两个特殊的 k-mer 子集,我们使用 k-mer 的 z 分数来衡量 k-mer 的频率与平均值的标准差数。然后,这些统计上可靠的 k-mer 用于构建用于纠错的布隆过滤器。我们的方法在真实和合成 NGS 数据集上的测试明显优于最先进的方法。
z 分数足以区分固体 k-mer 和弱 k-mer,特别适用于精确定位频率非常低的固体 k-mer。在 k-mer 上应用 z 分数可以显著提高纠错准确性。