UC Santa Cruz, Molecular, Cell and Developmental Biology, 1156 High Street, Santa Cruz, CA 95064, USA.
UC Santa Cruz, Genomics Institute, 1156 High Street, Santa Cruz, CA 95064, USA.
Gigascience. 2021 Mar 13;10(3). doi: 10.1093/gigascience/giab011.
The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis.
In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).
Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.
通过 RNA 测序(RNA-Seq)测量的基因表达的可重复性取决于测序深度。虽然未映射或非外显子的读取对基因表达定量没有贡献,但重复读取对定量有贡献,但对可重复性没有信息。我们表明,映射的外显子非重复(MEND)读取是用于基因表达分析的 RNA-Seq 数据集可重复性的有用度量。
在来自 48 个队列的 2179 个肿瘤的批量 RNA-Seq 数据集中,有助于基因表达分析可重复性的读取比例差异很大。未映射的读取构成所有读取的 1-77%(中位数 [IQR],3% [3-6%]);重复的读取构成映射读取的 3-100%(中位数 [IQR],27% [13-43%]);非外显子的读取构成映射的非重复读取的 4-97%(中位数 [IQR],25% [16-37%])。MEND 读取构成总读取的 0-79%(中位数 [IQR],50% [30-61%])。
由于 RNA-Seq 数据集中的并非所有读取对于基因表达测量的可重复性都是信息丰富的,并且信息丰富的读取比例也不同,因此我们建议报告数据集的测序深度以 MEND 读取,这可以明确反映基因表达的可重复性,而不是总读取,映射读取或外显子读取。我们提供了一个包含(i)现有必需工具(RSeQC、sambamba 和 samblaster)和(ii)从 RNA-Seq 数据文件计算 MEND 读取的自定义脚本的 Docker 映像。我们建议所有 RNA-Seq 基因表达实验、灵敏度研究和深度推荐使用 MEND 单位进行测序深度。