London School of Hygiene & Tropical Medicine, London, UK.
Plastic Surgery, Morriston Hospital, Swansea, Wales, UK.
BMJ Open. 2024 Mar 14;14(3):e076484. doi: 10.1136/bmjopen-2023-076484.
To explore whether large language models (LLMs) Generated Pre-trained Transformer (GPT)-3 and ChatGPT can write clinical letters and predict management plans for common orthopaedic scenarios.
Fifteen scenarios were generated and ChatGPT and GPT-3 prompted to write clinical letters and separately generate management plans for identical scenarios with plans removed.
Letters were assessed for readability using the Readable Tool. Accuracy of letters and management plans were assessed by three independent orthopaedic surgery clinicians.
Both models generated complete letters for all scenarios after single prompting. Readability was compared using Flesch-Kincade Grade Level (ChatGPT: 8.77 (SD 0.918); GPT-3: 8.47 (SD 0.982)), Flesch Readability Ease (ChatGPT: 58.2 (SD 4.00); GPT-3: 59.3 (SD 6.98)), Simple Measure of Gobbledygook (SMOG) Index (ChatGPT: 11.6 (SD 0.755); GPT-3: 11.4 (SD 1.01)), and reach (ChatGPT: 81.2%; GPT-3: 80.3%). ChatGPT produced more accurate letters (8.7/10 (SD 0.60) vs 7.3/10 (SD 1.41), p=0.024) and management plans (7.9/10 (SD 0.63) vs 6.8/10 (SD 1.06), p<0.001) than GPT-3. However, both LLMs sometimes omitted key information or added additional guidance which was at worst inaccurate.
This study shows that LLMs are effective for generation of clinical letters. With little prompting, they are readable and mostly accurate. However, they are not consistent, and include inappropriate omissions or insertions. Furthermore, management plans produced by LLMs are generic but often accurate. In the future, a healthcare specific language model trained on accurate and secure data could provide an excellent tool for increasing the efficiency of clinicians through summarisation of large volumes of data into a single clinical letter.
探索大型语言模型(LLM)生成的预训练转换器(GPT)-3 和 ChatGPT 是否可以撰写临床信件并预测常见骨科场景的管理计划。
生成了 15 个场景,并提示 ChatGPT 和 GPT-3 分别为相同场景撰写临床信件并删除管理计划。
使用可读性工具评估信件的可读性。由三名独立的骨科手术临床医生评估信件和管理计划的准确性。
在单次提示后,两种模型都为所有场景生成了完整的信件。使用 Flesch-Kincade 年级水平(ChatGPT:8.77(SD 0.918);GPT-3:8.47(SD 0.982))、Flesch 可读性舒适度(ChatGPT:58.2(SD 4.00);GPT-3:59.3(SD 6.98))、简单测错率(SMOG)指数(ChatGPT:11.6(SD 0.755);GPT-3:11.4(SD 1.01))和可及性(ChatGPT:81.2%;GPT-3:80.3%)来比较可读性。ChatGPT 生成的信件更准确(8.7/10(SD 0.60)与 7.3/10(SD 1.41),p=0.024)和管理计划(7.9/10(SD 0.63)与 6.8/10(SD 1.06),p<0.001)比 GPT-3 更准确。然而,两种 LLM 有时会遗漏关键信息或添加额外的指导,这些指导在最坏的情况下是不准确的。
本研究表明,LLM 非常适合生成临床信件。在少量提示下,它们具有可读性且大多准确。然而,它们并不一致,包括不适当的遗漏或插入。此外,LLM 生成的管理计划虽然通用,但通常是准确的。在未来,基于准确和安全数据训练的特定于医疗保健的语言模型可以为通过将大量数据汇总到单个临床信件中来提高临床医生的效率提供一个极好的工具。