Suppr超能文献

用于结构化信息提取的语法约束解码,将微调生成模型应用于临床试验摘要。

Grammar-constrained decoding for structured information extraction with fine-tuned generative models applied to clinical trial abstracts.

作者信息

Schmidt David M, Cimiano Philipp

机构信息

Center for Cognitive Interaction Technology (CITEC), Technical Faculty, Bielefeld University, Bielefeld, Germany.

出版信息

Front Artif Intell. 2025 Jan 7;7:1406857. doi: 10.3389/frai.2024.1406857. eCollection 2024.

Abstract

BACKGROUND

In the field of structured information extraction, there are typically semantic and syntactic constraints on the output of information extraction (IE) systems. These constraints, however, can typically not be guaranteed using standard (fine-tuned) encoder-decoder architectures. This has led to the development of constrained decoding approaches which allow, e.g., to specify constraints in form of context-free grammars. An open question is in how far an IE system can be effectively guided by a domain-specific grammar to ensure that the output structures follow the requirements of a certain domain data model.

METHODS

In this work we experimentally investigate the influence of grammar-constrained decoding as well as pointer generators on the performance of a domain-specific information extraction system. For this, we consider fine-tuned encoder-decoder models, Longformer and Flan-T5 in particular, and experimentally investigate whether the addition of grammar-constrained decoding and pointer generators improve information extraction results. Toward this goal, we consider the task of inducing structured representations from abstracts describing clinical trials, relying on the C-TrO ontology to semantically describe the clinical trials and their results. We frame the task as a slot filling problem where certain slots of templates need to be filled with token sequences occurring in the input text. We use a dataset comprising 211 annotated clinical trial abstracts about type 2 diabetes and glaucoma for training and evaluation. Our focus is on settings in which the available training data is in the order of a few hundred training examples, which we consider as a .

RESULTS

In all our experiments we could demonstrate the positive impact of grammar-constrained decoding, with an increase in score of pp 0.351 (absolute score 0.413) and pp 0.425 (absolute score 0.47) for the best-performing models on type 2 diabetes and glaucoma datasets, respectively. The addition of the pointer generators had a detrimental impact on the results, decreasing scores by pp 0.15 (absolute score 0.263) and pp 0.198 (absolute score 0.272) for the best-performing pointer generator models on type 2 diabetes and glaucoma datasets, respectively.

CONCLUSION

The experimental results indicate that encoder-decoder models used for structure prediction for information extraction tasks in low-resource settings clearly benefit from grammar-constrained decoding guiding the output generation. In contrast, the evaluated pointer generator models decreased the performance drastically in some cases. Moreover, the performance of the pointer models appears to depend both on the used base model as well as the function used for aggregating the attention values. How the size of large language models affects the performance benefit of grammar-constrained decoding remains to be more structurally investigated in future work.

摘要

背景

在结构化信息提取领域,信息提取(IE)系统的输出通常存在语义和句法约束。然而,使用标准(微调)的编码器 - 解码器架构通常无法保证这些约束。这导致了约束解码方法的发展,例如允许以上下文无关语法的形式指定约束。一个悬而未决的问题是,IE系统在多大程度上可以由特定领域的语法有效引导,以确保输出结构符合特定领域数据模型的要求。

方法

在这项工作中,我们通过实验研究语法约束解码以及指针生成器对特定领域信息提取系统性能的影响。为此,我们考虑微调的编码器 - 解码器模型,特别是Longformer和Flan - T5,并通过实验研究添加语法约束解码和指针生成器是否能改善信息提取结果。为了实现这一目标,我们考虑从描述临床试验的摘要中诱导结构化表示的任务,依靠C - TrO本体从语义上描述临床试验及其结果。我们将该任务构建为一个槽填充问题,其中模板的某些槽需要用输入文本中出现的令牌序列填充。我们使用一个包含211篇关于2型糖尿病和青光眼的带注释临床试验摘要的数据集进行训练和评估。我们关注的是可用训练数据数量在几百个训练示例左右的设置,我们将其视为一个……

结果

在我们所有的实验中,我们都证明了语法约束解码的积极影响,对于2型糖尿病和青光眼数据集上表现最佳的模型,分别将pp分数提高了0.351(绝对分数0.413)和0.425(绝对分数0.47)。添加指针生成器对结果有不利影响,对于2型糖尿病和青光眼数据集上表现最佳的指针生成器模型,分别使pp分数降低了0.15(绝对分数0.263)和0.198(绝对分数0.272)。

结论

实验结果表明,在低资源设置下用于信息提取任务结构预测的编码器 - 解码器模型明显受益于语法约束解码对输出生成的引导。相比之下,评估的指针生成器模型在某些情况下大幅降低了性能。此外,指针模型的性能似乎既取决于所使用的基础模型,也取决于用于聚合注意力值的函数。大语言模型的规模如何影响语法约束解码的性能优势,在未来的工作中仍有待更深入地研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9946/11747381/179ef6149577/frai-07-1406857-g0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验