通过大语言模型提高医学概念规范化中的数据质量。

Enhancing data quality in medical concept normalization through large language models.

作者信息

Chen Haihua, Li Ruochi, Cleveland Ana, Ding Junhua

机构信息

The Anuradha & Vikas Sinha Department of Data Science, University of North Texas, Denton, 76203, TX, USA.

Department of Computer Science, North Carolina State University, Raleigh, 27695, NC, USA.

出版信息

J Biomed Inform. 2025 May;165:104812. doi: 10.1016/j.jbi.2025.104812. Epub 2025 Apr 1.

OBJECTIVE

Medical concept normalization (MCN) aims to map informal medical terms to formal medical concepts, a critical task in building machine learning systems for medical applications. However, most existing studies on MCN primarily focus on models and algorithms, often overlooking the vital role of data quality. This research evaluates MCN performance across varying data quality scenarios and investigates how to leverage these evaluation results to enhance data quality, ultimately improving MCN performance through the use of large language models (LLMs). The effectiveness of the proposed approach is demonstrated through a case study.

METHODS

We begin by conducting a data quality evaluation of a dataset used for MCN. Based on these findings, we employ ChatGPT-based zero-shot prompting for data augmentation. The quality of the generated data is then assessed across the dimensions of correctness and comprehensiveness. A series of experiments is performed to analyze the impact of data quality on MCN model performance. These results guide us in implementing LLM-based few-shot prompting to further enhance data quality and improve model performance.

RESULTS

Duplication of data items within a dataset can lead to inaccurate evaluation results. Data augmentation techniques such as zero-shot and few-shot learning with ChatGPT can introduce duplicated data items, particularly those in the mean region of a dataset's distribution. As such, data augmentation strategies must be carefully designed, incorporating context information and training data to avoid these issues. Additionally, we found that including augmented data in the testing set is necessary to fairly evaluate the effectiveness of data augmentation strategies.

CONCLUSION

While LLMs can generate high-quality data for MCN, the success of data augmentation depends heavily on the strategy employed. Our study found that few-shot learning, with prompts that incorporate appropriate context and a small, representative set of original data, is an effective approach. The methods developed in this research, including the data quality evaluation framework, LLM-based data augmentation strategies, and procedures for data quality enhancement, provide valuable insights for data augmentation and evaluation in similar deep learning applications.

AVAILABILITY

https://github.com/RichardLRC/mcn-data-quality-llm/tree/main/evaluation.

目的

医学概念规范化（MCN）旨在将非正式医学术语映射到正式医学概念，这是构建医学应用机器学习系统的一项关键任务。然而，大多数现有的MCN研究主要集中在模型和算法上，常常忽视数据质量的重要作用。本研究评估了不同数据质量场景下的MCN性能，并研究如何利用这些评估结果来提高数据质量，最终通过使用大语言模型（LLM）来提升MCN性能。通过一个案例研究证明了所提方法的有效性。

方法

我们首先对用于MCN的数据集进行数据质量评估。基于这些发现，我们采用基于ChatGPT的零样本提示进行数据增强。然后从正确性和全面性维度评估生成数据的质量。进行一系列实验以分析数据质量对MCN模型性能的影响。这些结果指导我们实施基于LLM的少样本提示，以进一步提高数据质量并改善模型性能。

结果

数据集中数据项的重复会导致评估结果不准确。诸如使用ChatGPT进行零样本和少样本学习的数据增强技术可能会引入重复的数据项，尤其是数据集中分布均值区域的那些数据项。因此，必须精心设计数据增强策略，纳入上下文信息和训练数据以避免这些问题。此外，我们发现将增强数据包含在测试集中对于公平评估数据增强策略的有效性是必要的。

结论

虽然LLM可以为MCN生成高质量数据，但数据增强的成功很大程度上取决于所采用的策略。我们的研究发现，带有包含适当上下文和一小部分有代表性的原始数据的提示的少样本学习是一种有效的方法。本研究中开发的方法，包括数据质量评估框架、基于LLM的数据增强策略以及数据质量提升程序，为类似深度学习应用中的数据增强和评估提供了有价值的见解。

可用性

https://github.com/RichardLRC/mcn-data-quality-llm/tree/main/evaluation

Enhancing data quality in medical concept normalization through large language models.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

AVAILABILITY

目的

方法

结果

结论

可用性

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献