IEEE Trans Neural Netw Learn Syst. 2019 Dec;30(12):3774-3787. doi: 10.1109/TNNLS.2019.2899045. Epub 2019 Mar 15.
Crowdsourcing has become the most appealing way to provide a plethora of labels at a low cost. Nevertheless, labels from amateur workers are often noisy, which inevitably degenerates the robustness of subsequent learning models. To improve the label quality for subsequent use, majority voting (MV) is widely leveraged to aggregate crowdsourced labels due to its simplicity and scalability. However, when crowdsourced labels are "heavily" noisy (e.g., 40% of noisy labels), MV may not work well because of the fact "garbage (heavily noisy labels) in, garbage (full aggregated labels) out." This issue inspires us to think: if the ultimate target is to learn a robust model using noisy labels, why not provide partial aggregated labels and ensure that these labels are reliable enough for learning models? To solve this challenge by improving MV, we propose a coarse-to-fine label filtration model called double filter machine (DFM), which consists of a (majority) voting filter and a sparse filter serially. Specifically, the DFM refines crowdsourced labels from coarse filtering to fine filtering. In the stage of coarse filtering, the DFM aggregates crowdsourced labels by voting filter, which yields (quality-acceptable) full aggregated labels. In the stage of fine filtering, DFM further digs out a set of high-quality labels from full aggregated labels by sparse filter, since this filter can identify high-quality labels by the methodology of support selection. Based on the insight of compressed sensing, DFM recovers a ground-truth signal from heavily noisy data under a restricted isometry property. To sum up, the primary benefits of DFM are to keep the scalability by voting filter, while improve the robustness by sparse filter. We also derive theoretical guarantees for the convergence and recovery of DFM and reveal its complexity. We conduct comprehensive experiments on both the UCI simulated and the AMT crowdsourced datasets. Empirical results show that partial aggregated labels provided by DFM effectively improve the robustness of learning models.
众包已经成为提供大量低成本标签的最吸引人的方式。然而,业余工人的标签往往是嘈杂的,这不可避免地降低了后续学习模型的健壮性。为了提高后续使用的标签质量,由于其简单性和可扩展性,多数投票(MV)被广泛用于聚合众包标签。然而,当众包标签“严重”嘈杂(例如,40%的嘈杂标签)时,MV 可能无法正常工作,因为“垃圾(嘈杂标签)进,垃圾(全聚合标签)出”。这个问题启发我们思考:如果最终目标是使用嘈杂标签学习稳健的模型,为什么不提供部分聚合标签,并确保这些标签足够可靠,以用于学习模型?为了解决通过改进 MV 来解决这个挑战,我们提出了一种称为双过滤机(DFM)的从粗到细的标签过滤模型,它由一个(多数)投票过滤器和一个稀疏过滤器串联组成。具体来说,DFM 从粗过滤到细过滤来细化众包标签。在粗过滤阶段,DFM 通过投票过滤器聚合众包标签,从而产生(质量可接受)的全聚合标签。在细过滤阶段,DFM 进一步从全聚合标签中挖掘出一组高质量标签,因为该过滤器可以通过支持选择的方法来识别高质量标签。基于压缩感知的洞察力,DFM 在受限等距属性下从严重嘈杂数据中恢复出真实信号。总之,DFM 的主要优势是通过投票过滤器保持可扩展性,同时通过稀疏过滤器提高健壮性。我们还为 DFM 的收敛性和恢复性提供了理论保证,并揭示了其复杂性。我们在 UCI 模拟和 AMT 众包数据集上进行了全面的实验。实验结果表明,DFM 提供的部分聚合标签有效地提高了学习模型的健壮性。