Li Fengnan, Hill Elliot D, Jiang Shu, Gao Jiaxin, Engelhard Matthew M
Duke University / Durham, NC, USA.
Proc Conf Assoc Comput Linguist Meet. 2025 Jul;2025:30263-30283.
Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose (nterpretable etrieval-Augmented Classification for long nterspersed Document equences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.
基于Transformer的模型在文档分类中取得了最优性能,但由于自注意力模块中的二次计算复杂度,在处理长文本时存在困难。现有的解决方案,如稀疏注意力、分层模型和关键句子提取,部分解决了这个问题,但当输入序列特别长时仍然不足。为了应对这一挑战,我们提出了IRIS(用于长间隔文档序列的可解释检索增强分类),这是一个新颖的轻量级框架,它利用检索来高效地对长文档进行分类,同时增强可解释性。IRIS将文档分割成块,将它们的嵌入存储在向量数据库中,并使用可学习的查询向量检索与给定任务最相关的块。然后,线性注意力机制聚合检索到的嵌入进行分类,使模型能够处理任意长的文档,而不会增加计算成本,并且可以在单个GPU上进行训练。我们在六个数据集上的实验表明,IRIS在标准基准测试中取得了与基线模型相当的性能,并且在三个临床笔记疾病风险预测任务中表现出色,这些任务中的文档非常长且关键信息稀疏。此外,IRIS通过揭示模型识别的关键风险因素的清晰摘要提供全局可解释性。这些发现凸显了IRIS作为长文档分类的高效且可解释解决方案的潜力,特别是在医疗保健应用中,性能和可解释性都至关重要。