HiGen：用于分层文本分类的分层感知序列生成

HiGen: Hierarchy-Aware Sequence Generation for Hierarchical Text Classification.

作者信息

Jain Vidit, Rungta Mukund, Zhuang Yuchen, Yu Yue, Wang Zeyu, Gao Mu, Skolnick Jeffrey, Zhang Chao

机构信息

Georgia Institute of Technology.

Microsoft, Cambridge, USA.

出版信息

Proc Conf Assoc Comput Linguist Meet. 2024 Mar;2024(EACL):1354-1368.

PMID:39886530

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11781299/

Abstract

Hierarchical text classification (HTC) is a complex subtask under multi-label text classification, characterized by a hierarchical label taxonomy and data imbalance. The best-performing models aim to learn a static representation by combining document and hierarchical label information. However, the relevance of document sections can vary based on the hierarchy level, necessitating a dynamic document representation. To address this, we propose HiGen, a text-generation-based framework utilizing language models to encode dynamic text representations. We introduce a level-guided loss function to capture the relationship between text and label name semantics. Our approach incorporates a task-specific pretraining strategy, adapting the language model to in-domain knowledge and significantly enhancing performance for classes with limited examples. Furthermore, we present a new and valuable dataset called ENZYME, designed for HTC, which comprises articles from PubMed with the goal of predicting Enzyme Commission (EC) numbers. Through extensive experiments on the ENZYME dataset and the widely recognized WOS and NYT datasets, our methodology demonstrates superior performance, surpassing existing approaches while efficiently handling data and mitigating class imbalance. We release our code and dataset here: https://github.com/viditjain99/HiGen.

摘要

层次文本分类（HTC）是多标签文本分类中的一个复杂子任务，其特点是具有层次化标签分类法和数据不平衡。性能最佳的模型旨在通过结合文档和层次标签信息来学习静态表示。然而，文档各部分的相关性会因层次级别而异，因此需要动态的文档表示。为了解决这个问题，我们提出了HiGen，这是一个基于文本生成的框架，利用语言模型来编码动态文本表示。我们引入了一个层次引导的损失函数，以捕捉文本与标签名语义之间的关系。我们的方法采用了特定任务的预训练策略，使语言模型适应领域内知识，并显著提高了示例有限的类别的性能。此外，我们还提出了一个名为ENZYME的新的有价值的数据集，专为HTC设计，它包含来自PubMed的文章，目的是预测酶委员会（EC）编号。通过在ENZYME数据集以及广泛认可的WOS和NYT数据集上进行的大量实验，我们的方法展示了卓越的性能，超越了现有方法，同时有效地处理了数据并缓解了类不平衡问题。我们在此处发布代码和数据集：https://github.com/viditjain99/HiGen。