Itaya Yuki, Tamura Jun, Hayashi Kenichi, Yamamoto Kouji
Graduate School of Science and Technology, Keio University, Yokohama, Japan.
Graduate School of Medicine, Yokohama City University, Yokohama, Japan.
Stat Med. 2025 Jan 15;44(1-2):e10303. doi: 10.1002/sim.10303. Epub 2024 Dec 16.
Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC), also known as the phi coefficient, is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates-a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field.
在统计学和机器学习中,评估分类至关重要,因为它会影响各个领域的决策,比如危急情况下的患者预后和治疗。马修斯相关系数(MCC),也称为φ系数,被认为是一种可靠性高的性能指标,即使在存在类别不平衡的情况下也能提供平衡的度量。尽管其很重要,但对MCC的统计推断仍明显缺乏全面研究。这一缺陷常常导致研究仅仅对MCC点估计进行验证和比较——这种做法虽然常见,但却忽略了结果的统计显著性和可靠性。为了填补这一研究空白,我们的论文介绍并评估了几种为单个MCC以及配对设计中MCC之间的差异构建渐近置信区间的方法。通过在各种场景下的模拟,我们评估了这些方法的有限样本行为并比较了它们的性能。此外,通过实际数据分析,我们展示了我们的研究结果在比较二元分类器方面的潜在效用,突出了我们在该领域研究可能做出的贡献。