Agniel Denis, Hejblum Boris P
Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, MA 02115, USA.
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA University of Bordeaux, ISPED, INSERM U1219, INRIA SISTM, 146 rue Léo Saignat, 33076 Bordeaux, FRANCE Vaccine Research Institute, Créteil, FRANCE.
Biostatistics. 2017 Oct 1;18(4):589-604. doi: 10.1093/biostatistics/kxx005.
As gene expression measurement technology is shifting from microarrays to sequencing, the statistical tools available for their analysis must be adapted since RNA-seq data are measured as counts. It has been proposed to model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity. In this vein, we propose tcgsaseq, a principled, model-free, and efficient method for detecting longitudinal changes in RNA-seq gene sets defined a priori. The method identifies those gene sets whose expression varies over time, based on an original variance component score test accounting for both covariates and heteroscedasticity without assuming any specific parametric distribution for the (transformed) counts. We demonstrate that despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, and both may be computed quickly. A permutation version of the test is additionally proposed for very small sample sizes. Applied to both simulated data and two real datasets, tcgsaseq is shown to exhibit very good statistical properties, with an increase in stability and power when compared to state-of-the-art methods ROAST (rotation gene set testing), edgeR, and DESeq2, which can fail to control the type I error under certain realistic settings. We have made the method available for the community in the R package tcgsaseq.
随着基因表达测量技术从微阵列转向测序,由于RNA测序数据是以计数形式测量的,因此用于分析它们的统计工具必须进行调整。有人提议使用非参数回归将RNA测序计数建模为连续变量,以考虑其固有的异方差性。在此背景下,我们提出了tcgsaseq,这是一种有原则、无模型且高效的方法,用于检测先验定义的RNA测序基因集的纵向变化。该方法基于一个原始的方差成分得分检验来识别那些表达随时间变化的基因集,该检验同时考虑了协变量和异方差性,而无需对(转换后的)计数假设任何特定的参数分布。我们证明,尽管存在非参数成分,但我们的检验统计量具有简单的形式和极限分布,并且两者都可以快速计算。此外,还针对非常小的样本量提出了该检验的置换版本。应用于模拟数据和两个真实数据集时,tcgsaseq显示出非常好的统计特性,与现有方法ROAST(旋转基因集检验)、edgeR和DESeq2相比,其稳定性和功效有所提高,而这些现有方法在某些实际设置下可能无法控制I型错误。我们已通过R包tcgsaseq将该方法提供给社区使用。