Rios Anthony, Kavuluru Ramakanth
Department of Computer Science, University of Kentucky, Lexington, KY.
Division of Biomedical Informatics, University of Kentucky, Lexington, KY.
Proc Conf Empir Methods Nat Lang Process. 2018 Oct-Nov;2018:3132-3142.
Large multi-label datasets contain labels that occur thousands of times (frequent group), those that occur only a few times (few-shot group), and labels that never appear in the training dataset (zero-shot group). Multi-label few- and zero-shot label prediction is mostly unexplored on datasets with large label spaces, especially for text classification. In this paper, we perform a fine-grained evaluation to understand how state-of-the-art methods perform on infrequent labels. Furthermore, we develop few- and zero-shot methods for multi-label text classification when there is a known structure over the label space, and evaluate them on two publicly available medical text datasets: MIMIC II and MIMIC III. For few-shot labels we achieve improvements of 6.2% and 4.8% in R@10 for MIMIC II and MIMIC III, respectively, over prior efforts; the corresponding R@10 improvements for zero-shot labels are 17.3% and 19%.
大型多标签数据集包含出现数千次的标签(频繁组)、只出现几次的标签(少样本组)以及在训练数据集中从未出现的标签(零样本组)。在具有大标签空间的数据集上,尤其是对于文本分类,多标签少样本和零样本标签预测大多尚未得到充分探索。在本文中,我们进行了细粒度评估,以了解当前最先进的方法在不常见标签上的表现。此外,当标签空间存在已知结构时,我们开发了用于多标签文本分类的少样本和零样本方法,并在两个公开可用的医学文本数据集MIMIC II和MIMIC III上对其进行评估。对于少样本标签,与之前的工作相比,我们在MIMIC II和MIMIC III上的R@10分别提高了6.2%和4.8%;对于零样本标签,相应的R@10提高分别为17.3%和19%。