Suppr超能文献

基于分层提示和多模态转换器的多模态长文档分类。

Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer.

机构信息

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.

出版信息

Neural Netw. 2024 Aug;176:106322. doi: 10.1016/j.neunet.2024.106322. Epub 2024 Apr 16.

Abstract

In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi-modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.

摘要

在长文档分类(LDC)领域,先前的研究主要集中在对单模态文本的建模上,而忽略了包含图像的多模态文档的潜力。为了解决这一差距,我们引入了一种基于分层提示和多模态转换器(HPMT)的多模态长文档分类的创新方法。所提出的 HPMT 方法在章节和句子级别促进了多模态交互,从而全面捕捉了长文档的层次结构特征和复杂的多模态关联。具体来说,多尺度多模态转换器(MsMMT)专门用于捕获句子和图像之间的多粒度相关性。这是通过在句子特征上采用多尺度卷积核来实现的,从而提高了模型识别复杂模式的能力。此外,为了促进跨级别信息交互并促进在不同级别学习特定特征,我们引入了分层提示(HierPrompt)块。该块包含节级提示和句子级提示,两者均源自通过不同投影网络的全局提示。在四个具有挑战性的多模态长文档数据集上进行了广泛的实验。结果明确证明了我们提出的方法的优越性,展示了其在性能方面相对于现有技术的优势。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验