Suppr超能文献

合成 Spike-in 标准可改善 DNA 和 RNA 测序中特定运行的系统误差分析。

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing.

机构信息

Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland, United States of America.

出版信息

PLoS One. 2012;7(7):e41356. doi: 10.1371/journal.pone.0041356. Epub 2012 Jul 31.

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

摘要

虽然随机测序错误在较高的 DNA 或 RNA 测序深度下变得不那么重要,但系统测序错误(SSE)在高测序深度下占主导地位,并且很难与生物变异区分开来。这些 SSE 可能导致碱基质量得分低估某些基因组位置的错误概率,从而导致假阳性变异调用,特别是在 RNA 编辑、肿瘤、循环肿瘤细胞、细菌、线粒体异质性或混合 DNA 等混合物中。大多数用于纠正 SSE 的算法都需要一个数据集,用于计算 SSE 与读段和序列上下文各种特征的关联。该数据集通常来自要“重新校准”的数据集的一部分(基因组分析工具包或 GATK)或具有特殊特征的单独数据集(SysCall)。在这里,我们通过向人类 RNA 添加合成 RNA Spike-in 标准来结合这些方法的优点,并使用 GATK 对映射到 Spike-in 标准的读段进行碱基质量得分重新校准。与传统的使用映射到基因组的读段进行 GATK 重新校准相比,Spike-in 将 Illumina 碱基质量得分的准确性平均提高了 5 个 Phred 标度质量得分单位,在 CpG 位点甚至提高了 13 个单位。此外,由于用于重新校准的 Spike-in 数据与正在测序的基因组独立,因此即使对于没有全面准确的 SNP 数据库的许多物种,我们的方法也允许进行特定于运行的重新校准。我们还使用带有 Spike-in 标准的 GATK 来证明 Illumina RNA 测序运行高估了 AC、CC、GC、GG 和 TC 二核苷酸的质量得分,而 SOLiD 的二核苷酸 SSE 较少,但某些循环的 SSE 较多。我们得出的结论是,使用这些 DNA 和 RNA Spike-in 标准与 GATK 可以改善碱基质量得分重新校准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3e11/3409179/4c55ffc1103b/pone.0041356.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验