Tao Dapeng, Cheng Jun, Yu Zhengtao, Yue Kun, Wang Lizhen
IEEE Trans Neural Netw Learn Syst. 2019 Jan;30(1):163-174. doi: 10.1109/TNNLS.2018.2836969. Epub 2018 Jun 5.
Crowdsourcing labeling systems provide an efficient way to generate multiple inaccurate labels for given observations. If the competence level or the "reputation," which can be explained as the probabilities of annotating the right label, for each crowdsourcing annotators is equal and biased to annotate the right label, majority voting (MV) is the optimal decision rule for merging the multiple labels into a single reliable one. However, in practice, the competence levels of annotators employed by the crowdsourcing labeling systems are often diverse very much. In these cases, weighted MV is more preferred. The weights should be determined by the competence levels. However, since the annotators are anonymous and the ground-truth labels are usually unknown, it is hard to compute the competence levels of the annotators directly. In this paper, we propose to learn the weights for weighted MV by exploiting the expertise of annotators. Specifically, we model the domain knowledge of different annotators with different distributions and treat the crowdsourcing problem as a domain adaptation problem. The annotators provide labels to the source domains and the target domain is assumed to be associated with the ground-truth labels. The weights are obtained by matching the source domains with the target domain. Although the target-domain labels are unknown, we prove that they could be estimated under mild conditions. Both theoretical and empirical analyses verify the effectiveness of the proposed method. Large performance gains are shown for specific data sets.
众包标注系统提供了一种为给定观测生成多个不准确标签的有效方法。如果每个众包标注者的能力水平或“声誉”(可解释为标注正确标签的概率)相等且偏向于标注正确标签,多数投票(MV)是将多个标签合并为单个可靠标签的最优决策规则。然而,在实际中,众包标注系统所雇佣标注者的能力水平往往差异很大。在这些情况下,加权多数投票更受青睐。权重应由能力水平决定。然而,由于标注者是匿名的且真实标签通常未知,很难直接计算标注者的能力水平。在本文中,我们提议通过利用标注者的专业知识来学习加权多数投票的权重。具体而言,我们用不同的分布对不同标注者的领域知识进行建模,并将众包问题视为一个领域适应问题。标注者为源域提供标签,且假设目标域与真实标签相关联。通过使源域与目标域匹配来获得权重。尽管目标域标签未知,但我们证明在温和条件下它们是可以估计的。理论和实证分析都验证了所提方法的有效性。针对特定数据集展示了显著的性能提升。