Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, 37232, USA.
Department of Statistics, National Cheng Kung University, Tainan, 70101, Taiwan.
BMC Bioinformatics. 2018 May 30;19(1):191. doi: 10.1186/s12859-018-2191-5.
One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes.
To solve these issues, we developed a sample size and power estimation method named RnaSeqSampleSize, based on the distributions of gene average read counts and dispersions estimated from real RNA-seq data. Datasets from previous, similar experiments such as the Cancer Genome Atlas (TCGA) can be used as a point of reference. Read counts and their dispersions were estimated from the reference's distribution; using that information, we estimated and summarized the power and sample size. RnaSeqSampleSize is implemented in R language and can be installed from Bioconductor website. A user friendly web graphic interface is provided at http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ .
RnaSeqSampleSize provides a convenient and powerful way for power and sample size estimation for an RNAseq experiment. It is also equipped with several unique features, including estimation for interested genes or pathway, power curve visualization, and parameter optimization.
成功的 RNA 测序 (RNA-Seq) 实验中最重要但往往被忽视的一个环节是样本量估计。已经开发了一些基于负二项式模型的方法来根据单个基因的参数来估计样本量。然而,在 RNA-Seq 实验中,成千上万个基因同时被定量并测试差异表达。因此,还应仔细解决其他一些问题,包括多重统计检验的假发现率、不同基因的广泛分布的读取计数和分散度。
为了解决这些问题,我们开发了一种基于从真实 RNA-seq 数据中估计的基因平均读取计数和分散度分布的样本量和功效估计方法,命名为 RnaSeqSampleSize。以前类似的实验(如癌症基因组图谱 (TCGA))的数据集可以用作参考。从参考分布中估计读取计数及其分散度;利用这些信息,我们估计和总结了功效和样本量。RnaSeqSampleSize 是用 R 语言实现的,可以从 Bioconductor 网站安装。提供了一个用户友好的网络图形界面,网址为 http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/。
RnaSeqSampleSize 为 RNAseq 实验的功效和样本量估计提供了一种方便而强大的方法。它还具有一些独特的功能,包括对感兴趣的基因或途径的估计、功效曲线可视化和参数优化。