温度对使用大语言模型从临床试验出版物中提取信息的影响

The Impact of Temperature on Extracting Information From Clinical Trial Publications Using Large Language Models.

作者信息

Windisch Paul, Dennstädt Fabio, Koechli Carole, Schröder Christina, Aebersold Daniel M, Förster Robert, Zwahlen Daniel R

机构信息

Department of Radiation Oncology, Cantonal Hospital Winterthur, Winterthur, CHE.

Department of Radiation Oncology, Bern University Hospital, University of Bern, Bern, CHE.

出版信息

Cureus. 2024 Dec 15;16(12):e75748. doi: 10.7759/cureus.75748. eCollection 2024 Dec.

DOI:10.7759/cureus.75748

PMID:39811231

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11731902/

Abstract

Introduction The application of natural language processing (NLP) for extracting data from biomedical research has gained momentum with the advent of large language models (LLMs). However, the effect of different LLM parameters, such as temperature settings, on biomedical text mining remains underexplored and a consensus on what settings can be considered "safe" is missing. This study evaluates the impact of temperature settings on LLM performance for a named entity recognition and a classification task in clinical trial publications. Methods Two datasets were analyzed using GPT-4o and GPT-4o-mini models at nine different temperature settings (0.00-2.00). The models were used to extract the number of randomized participants and classify abstracts as randomized controlled trials (RCTs) and/or as oncology-related. Different performance metrics were calculated for each temperature setting and task. Results Both models provided correctly formatted predictions for more than 98.7% of abstracts across temperatures from 0.00 to 1.50. While the number of correctly formatted predictions started to decrease afterward with the most notable drop between temperatures 1.75 and 2.00, the other performance metrics remained largely stable. Conclusion Temperature settings at or below 1.50 yielded consistent performance across text-mining tasks, with performance declines at higher settings. These findings are aligned with research on different temperature settings for other tasks, suggesting stable performance within a controlled temperature range across various NLP applications.

摘要

引言随着大语言模型（LLMs）的出现，自然语言处理（NLP）在从生物医学研究中提取数据方面的应用日益受到关注。然而，不同的大语言模型参数，如温度设置，对生物医学文本挖掘的影响仍未得到充分探索，对于哪些设置可被视为“安全”也尚未达成共识。本研究评估了温度设置对大语言模型在临床试验出版物中命名实体识别和分类任务性能的影响。方法使用GPT - 4o和GPT - 4o - mini模型在九个不同温度设置（0.00 - 2.00）下分析两个数据集。这些模型用于提取随机分组参与者的数量，并将摘要分类为随机对照试验（RCTs）和/或与肿瘤学相关。针对每个温度设置和任务计算不同的性能指标。结果对于温度范围从0.00到1.50的摘要，两个模型均提供了格式正确的预测，占比超过98.7%。之后，格式正确的预测数量开始下降，在温度1.75至2.00之间下降最为明显，而其他性能指标在很大程度上保持稳定。结论温度设置为1.50及以下时，跨文本挖掘任务的性能保持一致，在较高设置下性能会下降。这些发现与针对其他任务的不同温度设置的研究一致，表明在各种自然语言处理应用中，在可控温度范围内性能稳定。