Liu Xinyu, Wang Yupeng, Sriram T N
Department of Statistics, University of Georgia, Athens, GA 30602, USA.
BMC Bioinformatics. 2014 Jun 14;15:190. doi: 10.1186/1471-2105-15-190.
Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual's class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective.
For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones.
For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package "SampleSizeSNP" is given in Additional file 1, and a ZIP file of the R package "SampleSizeSNP" is given in Additional file 2.
单核苷酸多态性(SNP)数据已被证明可用于预测多种表型,从个体的类别归属到其患疾病的风险。在多类别分类场景中,由于成本限制,临床样本通常有限,因此有必要确定基于SNP构建准确分类器所需的样本量。此类分类器的性能可以使用两类的受试者操作特征(ROC)曲线下面积(AUC)以及三类或更多类的ROC超曲面下体积(VUS)进行评估。基于AUC或VUS确定样本量不仅能保证总体正确分类率,还能使研究更具成本效益。
对于来自D(≥2)类的编码SNP数据,我们推导了一个最优贝叶斯分类器和一个线性分类器,并获得了每个分类器正确分类概率的正态近似。然后使用这些近似值来评估相关的AUC或VUS,其准确性通过蒙特卡罗模拟进行验证。我们给出了一种样本量确定方法,该方法可确保两个近似AUC(或VUS)之间的差异低于预先指定的阈值。然后通过模拟来说明我们的样本量确定方法的性能。对于具有三个和四个人群的HapMap数据,使用92个独立SNP构建线性分类器,并针对一系列阈值确定所需的总样本量。总共使用HapMap数据进行了四项不同的样本量确定研究,涵盖了从人群分离良好到分离不佳的情况。
对于多类别,我们开发了一种样本量确定方法,并说明了其在从估计的学习曲线中获得所需样本量方面的有用性。对于分类场景,该方法将帮助科学家确定手头的样本是否足够,或者是否需要更多样本才能达到预先指定的准确性。附加文件1中提供了R包“SampleSizeSNP”的PDF手册,附加文件2中提供了R包“SampleSizeSNP”的ZIP文件。