Key Laboratory of Biomedical Engineering & Technology of Shandong High School, Qilu Medical University, Zibo, P. R. China.
Xuzhou Medical University, Xuzhou, P. R. China.
Ann Hum Genet. 2021 Nov;85(6):235-244. doi: 10.1111/ahg.12441. Epub 2021 Aug 3.
Great efforts have been made on the algorithms that deal with RNA-seq data to enhance the accuracy and efficiency of differential expression (DE) analysis. However, no consensus has been reached on the proper threshold values of fold change and adjusted p-value for filtering differentially expressed genes (DEGs). It is generally believed that the more stringent the filtering threshold, the more reliable the result of a DE analysis. Nevertheless, by analyzing the impact of both adjusted p-value and fold change thresholds on DE analyses, with RNA-seq data obtained for three different cancer types from the Cancer Genome Atlas (TCGA) database, we found that, for a given sample size, the reproducibility of DE results became poorer when more stringent thresholds were applied. No matter which threshold level was applied, the overlap rates of DEGs were generally lower for small sample sizes than for large sample sizes. The raw read count analysis demonstrated that the transcript expression of the same gene in different samples, whether in tumor groups or in normal groups, showed high variations, which resulted in a drastic fluctuation in fold change values and adjustedp-values when different sets of samples were used. Overall, more stringent thresholds did not yield more reliable DEGs due to high variations in transcript expression; the reliability of DEGs obtained with small sample sizes was more susceptible to these variations. Therefore, less stringent thresholds are recommended for screening DEGs. Moreover, large sample sizes should be considered in RNA-seq experimental designs to reduce the interfering effect of variations in transcript expression on DEG identification.
人们在处理 RNA-seq 数据的算法上付出了巨大努力,以提高差异表达(DE)分析的准确性和效率。然而,对于过滤差异表达基因(DEGs)的折叠变化和调整后的 p 值的适当阈值值,尚未达成共识。一般认为,过滤阈值越严格,DE 分析的结果就越可靠。然而,通过分析调整后的 p 值和折叠变化阈值对 DE 分析的影响,我们使用来自癌症基因组图谱(TCGA)数据库的三种不同癌症类型的 RNA-seq 数据发现,对于给定的样本量,应用更严格的阈值会降低 DE 结果的重现性。无论应用哪个阈值水平,对于小样本量,DEGs 的重叠率通常低于大样本量。原始读取计数分析表明,同一基因在不同样本中的转录表达,无论是在肿瘤组还是在正常组中,都表现出高度的变化,这导致当使用不同的样本集时,折叠变化值和调整后的 p 值会出现剧烈波动。总体而言,由于转录表达的高度变化,更严格的阈值并没有产生更可靠的 DEGs;小样本量获得的 DEGs 的可靠性更容易受到这些变化的影响。因此,建议使用较不严格的阈值来筛选 DEGs。此外,在 RNA-seq 实验设计中应考虑较大的样本量,以减少转录表达变化对 DEG 识别的干扰影响。