Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States.
Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States.
J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.
Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations.
We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model.
Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set.
On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results.
GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
尽管监督机器学习在从临床记录中提取信息方面很受欢迎,但创建大型标注数据集需要广泛的领域专业知识并且耗时。同时,大型语言模型(LLM)已经表现出了有前途的迁移学习能力。在这项研究中,我们探讨了最近的 LLM 是否可以减少对大规模数据标注的需求。
我们整理了一个包含 769 份乳腺癌病理报告的数据集,这些报告手动标记了 12 个类别,以比较以下 LLM 的零样本分类能力:GPT-4、GPT-3.5、Starling 和 ClinicalCamel,以及 3 个模型的特定任务监督分类性能:随机森林、带注意力的长短时记忆网络(LSTM-Att)和 UCSF-BERT 模型。
在所有 12 个任务中,GPT-4 模型的表现要么明显优于最佳监督模型 LSTM-Att(平均宏 F1 得分为 0.86 对 0.75),要么与最佳监督模型相当,在标签不平衡度高的任务中具有优势。其他 LLM 表现不佳。GPT-4 常见的错误类别包括从多个样本和历史记录中进行不正确的推断,以及复杂的任务设计,而 LSTM-Att 的几个错误与对测试集的泛化能力差有关。
在难以轻松收集大型标注数据集的任务中,LLM 可以减少数据标注的负担。然而,如果使用 LLM 不可行,那么使用带有大型标注数据集的更简单模型可以提供相当的结果。
GPT-4 通过减少对大型标注数据集的需求,展示了加快临床 NLP 研究执行速度的潜力。这可能会增加基于 NLP 的变量和结果在临床研究中的利用。