Chen Qijie, Sun Haotong, Liu Haoyang, Jiang Yinghui, Ran Ting, Jin Xurui, Xiao Xianglu, Lin Zhimin, Chen Hongming, Niu Zhangmin
AIDD, Mindrank AI Ltd, Zhejiang 310000, China.
College of Life Sciences, Nankai University, Tianjin 300071, China.
Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad557.
In recent years, the development of natural language process (NLP) technologies and deep learning hardware has led to significant improvement in large language models (LLMs). The ChatGPT, the state-of-the-art LLM built on GPT-3.5 and GPT-4, shows excellent capabilities in general language understanding and reasoning. Researchers also tested the GPTs on a variety of NLP-related tasks and benchmarks and got excellent results. With exciting performance on daily chat, researchers began to explore the capacity of ChatGPT on expertise that requires professional education for human and we are interested in the biomedical domain.
To evaluate the performance of ChatGPT on biomedical-related tasks, this article presents a comprehensive benchmark study on the use of ChatGPT for biomedical corpus, including article abstracts, clinical trials description, biomedical questions, and so on. Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30. Through a series of experiments, we demonstrated the effectiveness and versatility of ChatGPT in biomedical text understanding, reasoning and generation, and the limitation of ChatGPT build on GPT-3.5.
All the datasets are available from BLURB benchmark https://microsoft.github.io/BLURB/index.html. The prompts are described in the article.
近年来,自然语言处理(NLP)技术和深度学习硬件的发展使得大语言模型(LLMs)有了显著改进。基于GPT-3.5和GPT-4构建的先进大语言模型ChatGPT在一般语言理解和推理方面展现出卓越能力。研究人员还在各种与NLP相关的任务和基准测试中对GPT进行了测试,并取得了优异成果。鉴于ChatGPT在日常聊天中表现出色,研究人员开始探索其在需要人类专业教育的专业领域的能力,我们对生物医学领域感兴趣。
为评估ChatGPT在生物医学相关任务上的表现,本文对ChatGPT用于生物医学语料库进行了全面的基准研究,包括文章摘要、临床试验描述、生物医学问题等。研究涵盖了命名实体识别、关系抽取、句子相似度、问答和文档分类等典型NLP任务。总体而言,ChatGPT的BLURB得分为58.50,而先进模型的得分为84.30。通过一系列实验,我们证明了ChatGPT在生物医学文本理解、推理和生成方面的有效性和通用性,以及基于GPT-3.5构建的ChatGPT的局限性。
所有数据集可从BLURB基准测试网站https://microsoft.github.io/BLURB/index.html获取。文章中描述了提示内容。