Marx Alexander, Backes Christina, Meese Eckart, Lenhof Hans-Peter, Keller Andreas
Chair for Clinical Bioinformatics, Medical Faculty, Saarland University, Saarbrücken 66123, Germany.
Department of Human Genetics, Saarland University, University Hospital, Homburg 66421, Germany.
Genomics Proteomics Bioinformatics. 2016 Feb;14(1):55-61. doi: 10.1016/j.gpb.2015.11.004. Epub 2016 Jan 29.
In many research disciplines, hypothesis tests are applied to evaluate whether findings are statistically significant or could be explained by chance. The Wilcoxon-Mann-Whitney (WMW) test is among the most popular hypothesis tests in medicine and life science to analyze if two groups of samples are equally distributed. This nonparametric statistical homogeneity test is commonly applied in molecular diagnosis. Generally, the solution of the WMW test takes a high combinatorial effort for large sample cohorts containing a significant number of ties. Hence, P value is frequently approximated by a normal distribution. We developed EDISON-WMW, a new approach to calculate the exact permutation of the two-tailed unpaired WMW test without any corrections required and allowing for ties. The method relies on dynamic programing to solve the combinatorial problem of the WMW test efficiently. Beyond a straightforward implementation of the algorithm, we presented different optimization strategies and developed a parallel solution. Using our program, the exact P value for large cohorts containing more than 1000 samples with ties can be calculated within minutes. We demonstrate the performance of this novel approach on randomly-generated data, benchmark it against 13 other commonly-applied approaches and moreover evaluate molecular biomarkers for lung carcinoma and chronic obstructive pulmonary disease (COPD). We found that approximated P values were generally higher than the exact solution provided by EDISON-WMW. Importantly, the algorithm can also be applied to high-throughput omics datasets, where hundreds or thousands of features are included. To provide easy access to the multi-threaded version of EDISON-WMW, a web-based solution of our algorithm is freely available at http://www.ccb.uni-saarland.de/software/wtest/.
在许多研究领域,假设检验用于评估研究结果是否具有统计学意义或是否可能是偶然因素导致的。威尔科克森-曼-惠特尼(WMW)检验是医学和生命科学中最常用的假设检验之一,用于分析两组样本是否均匀分布。这种非参数统计同质性检验常用于分子诊断。一般来说,对于包含大量相同值的大样本队列,WMW检验的求解需要进行大量的组合运算。因此,P值通常通过正态分布来近似。我们开发了EDISON-WMW,这是一种计算双尾非配对WMW检验精确排列的新方法,无需任何校正且允许存在相同值。该方法依靠动态规划有效地解决WMW检验的组合问题。除了算法的直接实现,我们还提出了不同的优化策略并开发了并行解决方案。使用我们的程序,可以在几分钟内计算出包含1000多个带有相同值样本的大样本队列的精确P值。我们在随机生成的数据上展示了这种新方法的性能,将其与其他13种常用方法进行基准测试,此外还评估了肺癌和慢性阻塞性肺疾病(COPD)的分子生物标志物。我们发现,近似P值通常高于EDISON-WMW提供的精确解。重要的是,该算法也可应用于包含数百或数千个特征的高通量组学数据集。为了方便使用EDISON-WMW的多线程版本,我们的算法基于网络的解决方案可在http://www.ccb.uni-saarland.de/software/wtest/免费获取。