Department of Biostatistics, Hyogo Medical University, Hyogo, Japan.
Department of Biostatistics, School of Medicine, Yokohama City University, Kanagawa, Japan.
Stat Med. 2023 Oct 15;42(23):4177-4192. doi: 10.1002/sim.9853. Epub 2023 Aug 1.
In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the -score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The -score has been extended for multi-class classification, and two types of -scores have been proposed for multi-class classification: a micro-averaged -score and a macro-averaged -score. The micro-averaged -score pools per-sample classifications across classes and then calculates the overall -score, whereas the macro-averaged -score computes an arithmetic mean of the -scores for each class. Additionally, Sokolova and Lapalme gave an alternative definition of the macro-averaged -score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi-class -scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two -scores in paired study design based on the large sample multivariate central limit theorem.
在现代医学中,医学检验被用于各种目的,包括诊断、疾病筛查、预后和风险预测。为了量化二项式医学检验的性能,我们通常使用敏感性、特异性和阴性及阳性预测值作为衡量标准。此外,由于其优良的特性,-分数已在医学领域得到应用,它被定义为精确性(阳性预测值)和召回率(敏感性)的调和平均值。-分数已被扩展用于多类分类,并且已经提出了两种用于多类分类的 -分数:微平均 -分数和宏平均 -分数。微平均 -分数在跨类别的样本分类中进行汇总,然后计算总体 -分数,而宏平均 -分数则计算每个类别的 -分数的算术平均值。此外,Sokolova 和 Lapalme 还给出了宏平均 -分数的另一种定义,即将精度和召回率的算术平均值的调和平均值作为宏平均 -分数。尽管已经提出了用于二项式和多类 -分数的一些统计推断方法,但它们的假设检验程序的方法学开发尚未完全推进。因此,我们旨在基于大样本多元中心极限定理,为配对研究设计中比较两个 -分数的假设检验程序。