Data Science Institute, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia.
School of Biomedical Engineering, Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia.
Nucleic Acids Res. 2021 Oct 11;49(18):e106. doi: 10.1093/nar/gkab610.
Raw sequencing reads of miRNAs contain machine-made substitution errors, or even insertions and deletions (indels). Although the error rate can be low at 0.1%, precise rectification of these errors is critically important because isoform variation analysis at single-base resolution such as novel isomiR discovery, editing events understanding, differential expression analysis, or tissue-specific isoform identification is very sensitive to base positions and copy counts of the reads. Existing error correction methods do not work for miRNA sequencing data attributed to miRNAs' length and per-read-coverage properties distinct from DNA or mRNA sequencing reads. We present a novel lattice structure combining kmers, (k - 1)mers and (k + 1)mers to address this problem. The method is particularly effective for the correction of indel errors. Extensive tests on datasets having known ground truth of errors demonstrate that the method is able to remove almost all of the errors, without introducing any new error, to improve the data quality from every-50-reads containing one error to every-1300-reads containing one error. Studies on experimental miRNA sequencing datasets show that the errors are often rectified at the 5' ends and the seed regions of the reads, and that there are remarkable changes after the correction in miRNA isoform abundance, volume of singleton reads, overall entropy, isomiR families, tissue-specific miRNAs, and rare-miRNA quantities.
miRNA 原始测序读段包含机器制造的替代错误,甚至插入和缺失(indels)。尽管错误率可能低至 0.1%,但这些错误的精确校正非常重要,因为单碱基分辨率的变体分析,如新型的 isomiR 发现、编辑事件理解、差异表达分析或组织特异性异构体鉴定,对读段的碱基位置和拷贝数非常敏感。现有的错误校正方法不适用于 miRNA 测序数据,这归因于 miRNA 的长度和每个读段的覆盖特性与 DNA 或 mRNA 测序读段不同。我们提出了一种新的格结构,结合了 kmers、(k - 1)mers 和 (k + 1)mers 来解决这个问题。该方法特别有效地校正插入缺失错误。在具有已知错误真实情况的数据集上进行的广泛测试表明,该方法能够去除几乎所有的错误,而不会引入任何新的错误,从而将每个包含一个错误的 50 个读段的数据质量提高到每个包含一个错误的 1300 个读段。对实验 miRNA 测序数据集的研究表明,错误通常在读段的 5' 端和种子区域得到校正,并且在校正后,miRNA 异构体丰度、单读段数量、整体熵、isomiR 家族、组织特异性 miRNAs 和稀有 miRNA 数量都有显著变化。