ChunkUIE：基于分块指令的统一信息提取

ChunkUIE: Chunked instruction-based unified information extraction.

作者信息

Li Wei, Liu Yingzhen, Yang Yinling, Zhang Ting, Men Wei

机构信息

National Defense University, Beijing, China.

State Key Laboratory of Geo-Information Engineering, Beijing, China.

出版信息

PLoS One. 2025 Jun 27;20(6):e0326764. doi: 10.1371/journal.pone.0326764. eCollection 2025.

DOI:10.1371/journal.pone.0326764

PMID:40577353

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12204470/

Abstract

Large language models (LLMs) have demonstrated remarkable performance across various linguistic tasks. However, existing LLMs perform inadequately in information extraction tasks for both Chinese and English. Numerous studies attempt to enhance model performance by increasing the scale of training data. However, discrepancies in the number and type of schemas used during training and evaluation can harm model effectiveness. To tackle this challenge, we propose ChunkUIE, a unified information extraction model that supports Chinese and English. We design a chunked instruction construction strategy that randomly and reproducibly divides all schemas into chunks containing an identical number of schemas. This approach ensures that the union of schemas across all chunks encompasses all schemas. By limiting the number of schemas in each instruction, this strategy effectively addresses the performance degradation caused by inconsistencies in schema counts between training and evaluation. Additionally, we construct some challenging negative schemas using a predefined hard schema dictionary, which mitigates the model's semantic confusion regarding similar schemas. Experimental results demonstrate that ChunkUIE enhances zero-shot performance in information extraction.

摘要

大语言模型（LLMs）在各种语言任务中都表现出了卓越的性能。然而，现有的大语言模型在中英文信息提取任务中表现不佳。许多研究试图通过增加训练数据规模来提高模型性能。然而，训练和评估过程中使用的模式数量和类型的差异可能会损害模型的有效性。为了应对这一挑战，我们提出了ChunkUIE，一种支持中文和英文的统一信息提取模型。我们设计了一种分块指令构建策略，该策略将所有模式随机且可重复地划分为包含相同数量模式的块。这种方法确保了所有块中模式的并集涵盖所有模式。通过限制每条指令中的模式数量，该策略有效地解决了训练和评估之间模式数量不一致导致的性能下降问题。此外，我们使用预定义的硬模式字典构建了一些具有挑战性的负模式，这减轻了模型对相似模式的语义混淆。实验结果表明，ChunkUIE提高了信息提取中的零样本性能。

相似文献

ChunkUIE: Chunked instruction-based unified information extraction.ChunkUIE：基于分块指令的统一信息提取

PLoS One. 2025 Jun 27;20(6):e0326764. doi: 10.1371/journal.pone.0326764. eCollection 2025.

Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力：方法开发研究

JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Dynamic few-shot prompting for clinical note section classification using lightweight, open-source large language models.使用轻量级开源大语言模型进行临床笔记章节分类的动态少样本提示

J Am Med Inform Assoc. 2025 Jul 1;32(7):1164-1173. doi: 10.1093/jamia/ocaf084.

A systematic review of speech, language and communication interventions for children with Down syndrome from 0 to 6 years.对0至6岁唐氏综合征儿童言语、语言和沟通干预措施的系统评价。

Int J Lang Commun Disord. 2022 Mar;57(2):441-463. doi: 10.1111/1460-6984.12699. Epub 2022 Feb 22.

Antibody tests for identification of current and past infection with SARS-CoV-2.抗体检测用于鉴定 SARS-CoV-2 的现症感染和既往感染。

Cochrane Database Syst Rev. 2022 Nov 17;11(11):CD013652. doi: 10.1002/14651858.CD013652.pub2.

Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.评估和增强用于遗传咨询支持的日本大语言模型：领域适应的比较研究与专家评估数据集的开发

JMIR Med Inform. 2025 Jan 16;13:e65047. doi: 10.2196/65047.

Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果：面向临床医生的网状Meta分析教程

Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.

Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力：系统评价和网络荟萃分析。

Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状荟萃分析。

Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.

本文引用的文献

Unity in Diversity: Collaborative Pre-training Across Multimodal Medical Sources.多元中的统一：跨多模态医学资源的协作式预训练

Proc Conf Assoc Comput Linguist Meet. 2024 Aug;2024(Volume 1 Long Papers):3644-3656. doi: 10.18653/v1/2024.acl-long.199.

Research on joint model relation extraction method based on entity mapping.基于实体映射的联合模型关系抽取方法研究。

PLoS One. 2024 Feb 23;19(2):e0298974. doi: 10.1371/journal.pone.0298974. eCollection 2024.

Span-based single-stage joint entity-relation extraction model.基于跨度的单阶段联合实体关系抽取模型。

PLoS One. 2023 Feb 7;18(2):e0281055. doi: 10.1371/journal.pone.0281055. eCollection 2023.

Anatomical entity mention recognition at literature scale.文献级别的解剖实体提及识别。

Bioinformatics. 2014 Mar 15;30(6):868-75. doi: 10.1093/bioinformatics/btt580. Epub 2013 Oct 25.

Extraction of potential adverse drug events from medical case reports.从医疗病例报告中提取潜在的药物不良事件。

J Biomed Semantics. 2012 Dec 20;3(1):15. doi: 10.1186/2041-1480-3-15.

GENIA corpus--semantically annotated corpus for bio-textmining.GENIA语料库——用于生物文本挖掘的语义标注语料库。

Bioinformatics. 2003;19 Suppl 1:i180-2. doi: 10.1093/bioinformatics/btg1023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ChunkUIE：基于分块指令的统一信息提取

ChunkUIE: Chunked instruction-based unified information extraction.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献