Virameteekul Paveen
Department of Computer Science & Engineering, Shanghai Jiao Tong University, Minhang, Shanghai, China.
PeerJ Comput Sci. 2022 Jun 10;8:e1003. doi: 10.7717/peerj-cs.1003. eCollection 2022.
Books are usually divided into chapters and sections. Correctly and automatically recognizing chapter boundaries can work as a proxy when segmenting long texts (a more general task). Book chapters can be easily segmented by humans, but automatic segregation is more challenging because the data is semi-structured. Since the concept of language is prone to ambiguity, it is essential to identify the relationship between the words in each paragraph and classify each consecutive paragraph based on their respective relationships with one another. Although researchers have designed deep learning-based models to solve this problem, these approaches have not considered the paragraph-level semantics among the consecutive paragraphs. In this article, we propose a novel deep learning-based method to segment book chapters that uses paragraph-level semantics and an attention mechanism. We first utilized a pre-trained XLNet model connected to a convolutional neural network (CNN) to extract the semantic meaning of each paragraph. Then, we measured the similarities in the semantics of each paragraph and designed an attention mechanism to inject the similarity information in order to better predict the chapter boundaries. The experimental results indicated that the performance of our proposed method can surpass those of other state-of-the-art (SOTA) methods for chapter segmentation on public datasets (the proposed model achieved an F1 score of 0.8856, outperforming the Bidirectional Encoder Representations from Transformers (BERT) model's F1 score of 0.6640). The ablation study also illustrated that the paragraph-level attention mechanism could produce a significant increase in performance.
书籍通常分为章节和小节。在分割长文本(一项更具普遍性的任务)时,正确且自动地识别章节边界可作为一种替代方法。书籍章节很容易被人类分割,但自动分割更具挑战性,因为数据是半结构化的。由于语言概念容易产生歧义,因此识别每个段落中单词之间的关系并根据它们彼此之间的各自关系对每个连续段落进行分类至关重要。尽管研究人员已经设计了基于深度学习的模型来解决这个问题,但这些方法没有考虑连续段落之间的段落级语义。在本文中,我们提出了一种基于深度学习的新颖方法来分割书籍章节,该方法使用段落级语义和注意力机制。我们首先利用一个预训练的XLNet模型连接到卷积神经网络(CNN)来提取每个段落的语义。然后,我们测量每个段落语义的相似度,并设计一种注意力机制来注入相似度信息,以便更好地预测章节边界。实验结果表明,我们提出的方法在公共数据集上的章节分割性能可以超过其他现有最先进(SOTA)方法(所提出的模型实现了0.8856的F1分数,优于来自Transformer的双向编码器表示(BERT)模型的0.6640的F1分数)。消融研究还表明,段落级注意力机制可以显著提高性能。