Suppr超能文献

挖掘统计上可靠的 k-mers 以实现准确的 NGS 错误纠正。

Mining statistically-solid k-mers for accurate NGS error correction.

机构信息

Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China.

School of Computing and Electronic Information, Guangxi University, Nanning, China.

出版信息

BMC Genomics. 2018 Dec 31;19(Suppl 10):912. doi: 10.1186/s12864-018-5272-y.

Abstract

BACKGROUND

NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

RESULTS

We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

CONCLUSION

The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

摘要

背景

NGS 数据包含许多机器诱导的错误。最先进的纠错方法严重依赖于固体 k-mer 的选择。固体 k-mer 是在 NGS 读取中频繁出现的 k-mer。其他 k-mer 称为弱 k-mer。固体 k-mer 不太可能包含错误,而弱 k-mer 很可能包含错误。一个深入研究的问题是找到一个好的频率截止值 f 来平衡固体和弱 k-mer 的数量。一旦确定了截止值,一个更具挑战性但研究较少的问题是:(i)删除可能包含错误的一小部分固体 k-mer,(ii)将一小部分可能不包含错误的弱 k-mer 添加到剩余的固体 k-mer 集合中。识别这两个 k-mer 子集可以提高纠错性能。

结果

我们提出使用伽马分布来建模错误 k-mer 的频率,使用混合高斯分布来建模正确的 k-mer,并将它们结合起来确定 f。为了识别这两个特殊的 k-mer 子集,我们使用 k-mer 的 z 分数来衡量 k-mer 的频率与平均值的标准差数。然后,这些统计上可靠的 k-mer 用于构建用于纠错的布隆过滤器。我们的方法在真实和合成 NGS 数据集上的测试明显优于最先进的方法。

结论

z 分数足以区分固体 k-mer 和弱 k-mer,特别适用于精确定位频率非常低的固体 k-mer。在 k-mer 上应用 z 分数可以显著提高纠错准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dfe0/6311904/467f778f32e4/12864_2018_5272_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验