Bangalore Sai Santosh, Wang Jelai, Allison David B
The University of Alabama at Birmingham, Section on Statistical Genetics, Department of Biostatistics, RPHB 327, 1665 University Boulevard, Birmingham, AL-35294-0022, USA.
Comput Stat Data Anal. 2009 May 15;53(7):2446-2452. doi: 10.1016/j.csda.2008.11.028.
In the fields of genomics and high dimensional biology (HDB), massive multiple testing prompts the use of extremely small significance levels. Because tail areas of statistical distributions are needed for hypothesis testing, the accuracy of these areas is important to confidently make scientific judgments. Previous work on accuracy was primarily focused on evaluating professionally written statistical software, like SAS, on the Statistical Reference Datasets (StRD) provided by National Institute of Standards and Technology (NIST) and on the accuracy of tail areas in statistical distributions. The goal of this paper is to provide guidance to investigators, who are developing their own custom scientific software built upon numerical libraries written by others. In specific, we evaluate the accuracy of small tail areas from cumulative distribution functions (CDF) of the Chi-square and t-distribution by comparing several open-source, free, or commercially licensed numerical libraries in Java, C, and R to widely accepted standards of comparison like ELV and DCDFLIB. In our evaluation, the C libraries and R functions are consistently accurate up to six significant digits. Amongst the evaluated Java libraries, Colt is most accurate. These languages and libraries are popular choices among programmers developing scientific software, so the results herein can be useful to programmers in choosing libraries for CDF accuracy.
在基因组学和高维生物学(HDB)领域,大规模多重检验促使人们使用极低的显著性水平。由于假设检验需要统计分布的尾部区域,这些区域的准确性对于可靠地做出科学判断至关重要。先前关于准确性的工作主要集中在评估专业编写的统计软件,如SAS,使用美国国家标准与技术研究院(NIST)提供的统计参考数据集(StRD),以及统计分布中尾部区域的准确性。本文的目的是为那些基于他人编写的数值库开发自己的定制科学软件的研究人员提供指导。具体而言,我们通过将Java、C和R中的几个开源、免费或商业许可的数值库与广泛接受的比较标准(如ELV和DCDFLIB)进行比较,来评估卡方分布和t分布的累积分布函数(CDF)中小尾部区域的准确性。在我们的评估中,C库和R函数在六位有效数字内始终保持准确。在所评估的Java库中,Colt最为准确。这些语言和库是开发科学软件的程序员的常用选择,因此本文的结果对于程序员选择具有CDF准确性的库可能会有所帮助。