Suppr超能文献

基于堆叠模型的分类器处理多组噪声标签

Stacking Model-Based Classifiers for Dealing With Multiple Sets of Noisy Labels.

作者信息

Montani Giulia, Cappozzo Andrea

机构信息

Data Reply srl, Turin, Italy.

Department of Statistical Sciences, Università Cattolica del Sacro Cuore, Milan, Italy.

出版信息

Biom J. 2025 Apr;67(2):e70042. doi: 10.1002/bimj.70042.

Abstract

Supervised learning in presence of multiple sets of noisy labels is a challenging task that is receiving increasing interest in the ever-evolving landscape of healthcare analytics. Such an issue arises when multiple annotators are tasked to manually label the same training samples, potentially giving rise to discrepancies in class assignments among the supplied labels with respect to the ground truth. Commonly, the labeling process is entrusted to a small group of domain experts, and different level of experience and subjectivity may result in noisy training labels. To solve the classification task leveraging on the availability of multiple data annotators, we introduce a novel ensemble methodology constructed combining model-based classifiers separately trained on single sets of noisy labels. Eigenvalue Decomposition Discriminant Analysis is employed for the definition of the base learners, and six distinct averaging strategies are proposed to combine them. Two solutions necessitate a priori information, such as the partial knowledge of the ground truth labels or the annotators' level of expertise. Differently, the remaining four approaches are entirely data-driven. A simulation study and an application on real data showcase the improved predictive performance of our proposal, while also demonstrating the ability of automatically inferring annotators' expertise level as a by-product of the learning process.

摘要

在存在多组噪声标签的情况下进行监督学习是一项具有挑战性的任务,在不断发展的医疗保健分析领域中受到越来越多的关注。当多个注释者被要求手动标记相同的训练样本时,就会出现这样的问题,相对于真实情况,所提供的标签之间的类别分配可能会产生差异。通常,标记过程委托给一小群领域专家,不同水平的经验和主观性可能会导致有噪声的训练标签。为了利用多个数据注释者的可用性来解决分类任务,我们引入了一种新颖的集成方法,该方法由分别在单组噪声标签上训练的基于模型的分类器组合而成。特征值分解判别分析用于定义基础学习器,并提出了六种不同的平均策略来组合它们。两种解决方案需要先验信息,例如真实标签的部分知识或注释者的专业水平。不同的是,其余四种方法完全是数据驱动的。一项模拟研究和对真实数据的应用展示了我们的提议具有更高的预测性能,同时还展示了作为学习过程的副产品自动推断注释者专业水平的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6763/11898607/36be5f694640/BIMJ-67-e70042-g009.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验