Li Xing, Nair Asha, Wang Shengqin, Wang Liguo
Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA.
Methods Mol Biol. 2015;1269:137-46. doi: 10.1007/978-1-4939-2291-8_8.
Direct sequencing of the complementary DNA (cDNA) using high-throughput sequencing technologies (RNA-seq) is widely used and allows for more comprehensive understanding of the transcriptome than microarray. In theory, RNA-seq should be able to precisely identify and quantify all RNA species, small or large, at low or high abundance. However, RNA-seq is a complicated, multistep process involving reverse transcription, amplification, fragmentation, purification, adaptor ligation, and sequencing. Improper operations at any of these steps could make biased or even unusable data. Additionally, RNA-seq intrinsic biases (such as GC bias and nucleotide composition bias) and transcriptome complexity can also make data imperfect. Therefore, comprehensive quality assessment is the first and most critical step for all downstream analyses and results interpretation. This chapter discusses the most widely used quality control metrics including sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, PCR bias, GC bias, rRNA and mitochondria contamination, coverage uniformity, etc.
使用高通量测序技术(RNA测序)对互补DNA(cDNA)进行直接测序被广泛应用,与微阵列相比,它能让我们更全面地了解转录组。理论上,RNA测序应该能够精确识别和定量所有RNA种类,无论其大小或丰度高低。然而,RNA测序是一个复杂的多步骤过程,涉及逆转录、扩增、片段化、纯化、接头连接和测序。这些步骤中任何一步操作不当都可能产生有偏差甚至无法使用的数据。此外,RNA测序的固有偏差(如GC偏差和核苷酸组成偏差)以及转录组的复杂性也会使数据不完美。因此,全面的质量评估是所有下游分析和结果解读的首要且最关键的步骤。本章讨论了最广泛使用的质量控制指标,包括序列质量、测序深度、读段重复率(克隆读段)、比对质量、核苷酸组成偏差、PCR偏差、GC偏差、rRNA和线粒体污染、覆盖均匀性等。