Brief Bioinform. 2018 Jan 1;19(1):109-117. doi: 10.1093/bib/bbw103.
The individual sample heterogeneity is one of the biggest obstacles in biomarker identification for complex diseases such as cancers. Current statistical models to identify differentially expressed genes between disease and control groups often overlook the substantial human sample heterogeneity. Meanwhile, traditional nonparametric tests lose detailed data information and sacrifice the analysis power, although they are distribution free and robust to heterogeneity. Here, we propose an empirical likelihood ratio test with a mean-variance relationship constraint (ELTSeq) for the differential expression analysis of RNA sequencing (RNA-seq). As a distribution-free nonparametric model, ELTSeq handles individual heterogeneity by estimating an empirical probability for each observation without making any assumption about read-count distribution. It also incorporates a constraint for the read-count overdispersion, which is widely observed in RNA-seq data. ELTSeq demonstrates a significant improvement over existing methods such as edgeR, DESeq, t-tests, Wilcoxon tests and the classic empirical likelihood-ratio test when handling heterogeneous groups. It will significantly advance the transcriptomics studies of cancers and other complex disease.
个体样本异质性是癌症等复杂疾病生物标志物识别的最大障碍之一。目前用于识别疾病和对照组之间差异表达基因的统计模型往往忽略了大量的人类样本异质性。同时,传统的非参数检验虽然对异质性具有鲁棒性且无需分布假设,但会丢失详细的数据信息并牺牲分析能力。在这里,我们针对 RNA 测序(RNA-seq)提出了一种带有均值-方差关系约束的经验似然比检验(ELTSeq),用于差异表达分析。作为一种无分布的非参数模型,ELTSeq 通过对每个观测值进行经验概率估计来处理个体异质性,而无需对读取计数分布做出任何假设。它还包含了对 RNA-seq 数据中广泛观察到的读取计数过分散的约束。当处理异质组时,ELTSeq 相较于 edgeR、DESeq、t 检验、Wilcoxon 检验和经典的经验似然比检验等现有方法有显著的改进。它将极大地推进癌症和其他复杂疾病的转录组学研究。