Wu Jin Chu, Martin Alvin F, Kacker Raghu N
National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA.
Commun Stat Simul Comput. 2018;48(2). doi: 10.1080/03610918.2018.1521974.
ROC analysis involving two large datasets is an important method for analyzing statistics of interest for decision making of a classifier in many disciplines. And data dependency due to multiple use of the same subjects exists ubiquitously in order to generate more samples because of limited resources. Hence, a two-layer data structure is constructed and the nonparametric two-sample two-layer bootstrap is employed to estimate standard errors of statistics of interest derived from two sets of data, such as a weighted sum of two probabilities. In this article, to reduce the bootstrap variance and ensure the accuracy of computation, Monte Carlo studies of bootstrap variability were carried out to determine the appropriate number of bootstrap replications in ROC analysis with data dependency. It is suggested that with a tolerance 0.02 of the coefficient of variation, 2,000 bootstrap replications be appropriate under such circumstances.
涉及两个大型数据集的ROC分析是许多学科中分析分类器决策相关统计量的重要方法。由于资源有限,为了生成更多样本,同一受试者多次使用导致的数据依赖性普遍存在。因此,构建了两层数据结构,并采用非参数双样本两层自助法来估计从两组数据中得出的感兴趣统计量的标准误差,例如两个概率的加权和。在本文中,为了减少自助法方差并确保计算准确性,进行了自助法变异性的蒙特卡罗研究,以确定在存在数据依赖性的ROC分析中自助重复抽样的合适次数。建议在变异系数容忍度为0.02的情况下,这种情况下2000次自助重复抽样是合适的。