Shaib Chantal, Li Millicent L, Joseph Sebastian, Marshall Iain J, Li Junyi Jessy, Wallace Byron C
Northeastern University.
The University of Texas at Austin.
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:1387-1407. doi: 10.18653/v1/2023.acl-short.119.
Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data and annotations used in this work.
大型语言模型,尤其是GPT-3,能够在少样本和零样本设置下生成高质量的通用领域新闻文章摘要。然而,尚不清楚此类模型在生物医学等更专业、高风险的领域是否同样适用。在本文中,我们邀请领域专家(接受过医学培训的人员)在零监督的情况下评估GPT-3生成的生物医学文章摘要。我们考虑了单文档和多文档设置。在前一种情况下,GPT-3的任务是生成描述随机对照试验的文章的常规和通俗易懂的摘要;在后一种情况下,我们评估GPT-3能够在多大程度上整合一组文章中报告的证据。我们设计了一种注释方案来评估模型输出,重点是评估生成摘要的事实准确性。我们发现,虽然GPT-3能够如实地总结和简化单篇生物医学文章,但它难以对多篇文档的研究结果进行准确汇总。我们发布了这项工作中使用的所有数据和注释。