Division of Personalized Nutrition and Medicine, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA.
BMC Bioinformatics. 2010 Jan 25;11:48. doi: 10.1186/1471-2105-11-48.
Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (pi1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes.
A sample size estimate based on the common formulation, to achieve the desired sensitivity on average, can be calculated using a univariate method without taking the correlation among genes into consideration. This formulation of sample size problem is inadequate because the probability of detecting the specified sensitivity can be lower than 50%. On the other hand, the needed sample size calculated by the proposed permutation method will ensure detecting at least the desired sensitivity with 95% probability. The method is shown to perform well for a real example dataset using a small pilot dataset with 4-6 samples per group.
We recommend that the sample size problem should be formulated to detect a specified proportion of differentially expressed genes with 95% probability. This formulation ensures finding the desired proportion of true positives with high probability. The proposed permutation method takes the correlation structure and effect size heterogeneity into consideration and works well using only a small pilot dataset.
在进行微阵列实验之前,需要确定的一个重要问题是为了有足够的能力来识别差异表达基因,需要进行多少个阵列。本文讨论了在微阵列实验中样本量估计问题的公式化、参数规范和常用方法中一些关键问题。常见的样本量估计方法被公式化为在指定的错误发现率 (FDR) 水平和指定的真差异表达基因的预期比例 (pi1) 下,平均达到指定灵敏度(检测到的真正差异表达基因的比例)所需的最小样本量。不幸的是,在这种公式化中,检测到指定灵敏度的概率可能很低。我们将样本量问题公式化为在指定的显著水平下以 95%的概率达到指定灵敏度所需的阵列数量。提出了一种使用小的试验数据集进行估计的排列方法。该方法考虑了基因之间的相关性和效应大小异质性。
基于常见公式,为了平均达到所需的灵敏度,可以使用不考虑基因之间相关性的单变量方法计算样本量估计值。这种样本量问题的公式化是不充分的,因为检测到指定灵敏度的概率可能低于 50%。另一方面,通过提议的排列方法计算出的所需样本量将确保以 95%的概率至少检测到所需的灵敏度。该方法在使用具有 4-6 个样本/组的小试验数据集的真实示例数据集上表现良好。
我们建议将样本量问题公式化为以 95%的概率检测到指定比例的差异表达基因。这种公式化确保以高概率找到所需比例的真正阳性。提议的排列方法考虑了相关性结构和效应大小异质性,并且仅使用小的试验数据集即可很好地工作。