通过成本敏感学习实现的奈曼-皮尔逊多类分类

Neyman-Pearson Multi-class Classification via Cost-sensitive Learning.

作者信息

Tian Ye, Feng Yang

机构信息

Department of Statistics, Columbia University.

Department of Biostatistics, School of Global Public Health, New York University.

出版信息

J Am Stat Assoc. 2025;120(550):1164-1177. doi: 10.1080/01621459.2024.2402567. Epub 2024 Nov 19.

DOI:10.1080/01621459.2024.2402567

PMID:40689012

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12268361/

Abstract

Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications such as loan default prediction, different types of errors can have varying consequences. To address this asymmetry issue, two popular paradigms have been developed: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Previous studies on the NP paradigm have primarily focused on the binary case, while the multi-class NP problem poses a greater challenge due to its unknown feasibility. In this work, we tackle the multi-class NP problem by establishing a connection with the CS problem via strong duality and propose two algorithms. We extend the concept of NP oracle inequalities, crucial in binary classifications, to NP oracle properties in the multi-class context. Our algorithms satisfy these NP oracle properties under certain conditions. Furthermore, we develop practical algorithms to assess the feasibility and strong duality in multi-class NP problems, which can offer practitioners the landscape of a multi-class NP problem with various target error levels. Simulations and real data studies validate the effectiveness of our algorithms. To our knowledge, this is the first study to address the multi-class NP problem with theoretical guarantees. The proposed algorithms have been implemented in the R package npcs, which is available on CRAN.

摘要

大多数现有的分类方法旨在最小化总体误分类错误率。然而，在诸如贷款违约预测等应用中，不同类型的错误可能会产生不同的后果。为了解决这种不对称问题，已经开发了两种流行的范式：奈曼 - 皮尔逊（NP）范式和成本敏感（CS）范式。先前关于NP范式的研究主要集中在二分类情况，而多分类NP问题由于其可行性未知而带来了更大的挑战。在这项工作中，我们通过强对偶性与CS问题建立联系来解决多分类NP问题，并提出了两种算法。我们将二分类中至关重要的NP预言机不等式的概念扩展到多分类背景下的NP预言机性质。我们的算法在某些条件下满足这些NP预言机性质。此外，我们开发了实用算法来评估多分类NP问题中的可行性和强对偶性，这可以为从业者提供具有各种目标错误水平的多分类NP问题的全貌。模拟和实际数据研究验证了我们算法的有效性。据我们所知，这是第一项在理论保证下解决多分类NP问题的研究。所提出的算法已在R包npcs中实现，该包可在CRAN上获取。