Li Wei, Liu Yingzhen, Yang Yinling, Zhang Ting, Men Wei
National Defense University, Beijing, China.
State Key Laboratory of Geo-Information Engineering, Beijing, China.
PLoS One. 2025 Jun 27;20(6):e0326764. doi: 10.1371/journal.pone.0326764. eCollection 2025.
Large language models (LLMs) have demonstrated remarkable performance across various linguistic tasks. However, existing LLMs perform inadequately in information extraction tasks for both Chinese and English. Numerous studies attempt to enhance model performance by increasing the scale of training data. However, discrepancies in the number and type of schemas used during training and evaluation can harm model effectiveness. To tackle this challenge, we propose ChunkUIE, a unified information extraction model that supports Chinese and English. We design a chunked instruction construction strategy that randomly and reproducibly divides all schemas into chunks containing an identical number of schemas. This approach ensures that the union of schemas across all chunks encompasses all schemas. By limiting the number of schemas in each instruction, this strategy effectively addresses the performance degradation caused by inconsistencies in schema counts between training and evaluation. Additionally, we construct some challenging negative schemas using a predefined hard schema dictionary, which mitigates the model's semantic confusion regarding similar schemas. Experimental results demonstrate that ChunkUIE enhances zero-shot performance in information extraction.
大语言模型(LLMs)在各种语言任务中都表现出了卓越的性能。然而,现有的大语言模型在中英文信息提取任务中表现不佳。许多研究试图通过增加训练数据规模来提高模型性能。然而,训练和评估过程中使用的模式数量和类型的差异可能会损害模型的有效性。为了应对这一挑战,我们提出了ChunkUIE,一种支持中文和英文的统一信息提取模型。我们设计了一种分块指令构建策略,该策略将所有模式随机且可重复地划分为包含相同数量模式的块。这种方法确保了所有块中模式的并集涵盖所有模式。通过限制每条指令中的模式数量,该策略有效地解决了训练和评估之间模式数量不一致导致的性能下降问题。此外,我们使用预定义的硬模式字典构建了一些具有挑战性的负模式,这减轻了模型对相似模式的语义混淆。实验结果表明,ChunkUIE提高了信息提取中的零样本性能。