Liu Tengfei, Hu Yongli, Wang Boyue, Sun Yanfeng, Gao Junbin, Yin Baocai
IEEE Trans Neural Netw Learn Syst. 2023 Oct;34(10):8071-8085. doi: 10.1109/TNNLS.2022.3185295. Epub 2023 Oct 5.
Long document classification (LDC) has been a focused interest in natural language processing (NLP) recently with the exponential increase of publications. Based on the pretrained language models, many LDC methods have been proposed and achieved considerable progression. However, most of the existing methods model long documents as sequences of text while omitting the document structure, thus limiting the capability of effectively representing long texts carrying structure information. To mitigate such limitation, we propose a novel hierarchical graph convolutional network (HGCN) for structured LDC in this article, in which a section graph network is proposed to model the macrostructure of a document and a word graph network with a decoupled graph convolutional block is designed to extract the fine-grained features of a document. In addition, an interaction strategy is proposed to integrate these two networks as a whole by propagating features between them. To verify the effectiveness of the proposed model, four structured long document datasets are constructed, and the extensive experiments conducted on these datasets and another unstructured dataset show that the proposed method outperforms the state-of-the-art related classification methods.
随着出版物数量呈指数级增长,长文档分类(LDC)最近成为自然语言处理(NLP)领域的一个研究热点。基于预训练语言模型,人们提出了许多长文档分类方法,并取得了显著进展。然而,现有的大多数方法将长文档建模为文本序列,而忽略了文档结构,从而限制了有效表示携带结构信息的长文本的能力。为了缓解这种限制,我们在本文中提出了一种用于结构化长文档分类的新型分层图卷积网络(HGCN),其中提出了一个章节图网络来对文档的宏观结构进行建模,并设计了一个带有解耦图卷积块的词图网络来提取文档的细粒度特征。此外,还提出了一种交互策略,通过在两个网络之间传播特征,将它们作为一个整体进行集成。为了验证所提出模型的有效性,我们构建了四个结构化长文档数据集,在这些数据集以及另一个非结构化数据集上进行的广泛实验表明,所提出的方法优于当前最先进的相关分类方法。