Zhu Kexin, Zhang Jiajie, Klishin Anton, Esser Mario, Blumentals William A, Juhaeri Juhaeri, Jouquelet-Royer Corinne, Sinnott Sarah-Jo
Epidemiology and Benefit Risk, Sanofi, Bridgewater, New Jersey, USA.
Babraham Research Campus, Sanofi, Cambridge, UK.
Pharmacoepidemiol Drug Saf. 2025 Feb;34(2):e70111. doi: 10.1002/pds.70111.
Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.
A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.
Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.
ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.
药物流行病学研究需要准确的疾病背景流行病学数据。我们评估了大型语言模型(LLM),包括ChatGPT-3.5、ChatGPT-4和谷歌巴德,在被问及疾病频率问题时的表现。
共提出21个关于常见和罕见疾病患病率和发病率的问题,并在不同日期向每个LLM提交两次。基准数据通过针对“金标准”参考文献(如政府统计数据、同行评审文章)的文献检索获得。通过比较LLM的回答与基准数据来评估准确性。通过比较对不同日期提交的相同查询的回答来确定一致性。评估参考文献的相关性和真实性。
三个LLM生成了126个回答。在ChatGPT-4中,76.2%的回答是准确的,高于巴德的50.0%和ChatGPT-3.5的45.2%。ChatGPT-4表现出比巴德(57.9%)或ChatGPT-3.5(46.7%)更高的一致性(71.4%)。ChatGPT-4提供了52个参考文献,其中27个(51.9%)提供了相关信息,且均为真实的。巴德提供的参考文献中只有9.2%(10/109)是相关的。在109个唯一参考文献中,67.7%是真实的,7.7%提供的信息不足无法获取,10.8%提供了不准确的引用,13.8%不存在/是编造的。ChatGPT-3.5没有提供任何参考文献。
与巴德和ChatGPT-3.5相比,ChatGPT-4在检索疾病流行病学信息方面表现更优。然而,所有三个LLM都给出了不准确的回答,包括不相关、不完整或编造的参考文献。这些局限性使得制药行业、学术界或监管机构的研究人员无法使用当前形式的LLM来获取准确的疾病流行病学信息。