• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GPT-4 作为 X 数据标注员:在立场分类任务中表现如何。

GPT-4 as an X data annotator: Unraveling its performance on a stance classification task.

机构信息

Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada.

Department of Social Work, Lakehead University, Thunder Bay, Ontario, Canada.

出版信息

PLoS One. 2024 Aug 15;19(8):e0307741. doi: 10.1371/journal.pone.0307741. eCollection 2024.

DOI:10.1371/journal.pone.0307741
PMID:39146280
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11326574/
Abstract

Data annotation in NLP is a costly and time-consuming task, traditionally handled by human experts who require extensive training to enhance the task-related background knowledge. Besides, labeling social media texts is particularly challenging due to their brevity, informality, creativity, and varying human perceptions regarding the sociocultural context of the world. With the emergence of GPT models and their proficiency in various NLP tasks, this study aims to establish a performance baseline for GPT-4 as a social media text annotator. To achieve this, we employ our own dataset of tweets, expertly labeled for stance detection with full inter-rater agreement among three annotators. We experiment with three techniques: Zero-shot, Few-shot, and Zero-shot with Chain-of-Thoughts to create prompts for the labeling task. We utilize four training sets constructed with different label sets, including human labels, to fine-tune transformer-based large language models and various combinations of traditional machine learning models with embeddings for stance classification. Finally, all fine-tuned models undergo evaluation using a common testing set with human-generated labels. We use the results from models trained on human labels as the benchmark to assess GPT-4's potential as an annotator across the three prompting techniques. Based on the experimental findings, GPT-4 achieves comparable results through the Few-shot and Zero-shot Chain-of-Thoughts prompting methods. However, none of these labeling techniques surpass the top three models fine-tuned on human labels. Moreover, we introduce the Zero-shot Chain-of-Thoughts as an effective strategy for aspect-based social media text labeling, which performs better than the standard Zero-shot and yields results similar to the high-performing yet expensive Few-shot approach.

摘要

自然语言处理中的数据标注是一项昂贵且耗时的任务,传统上由需要广泛培训以增强与任务相关的背景知识的人类专家来完成。此外,由于社交媒体文本的简洁性、非正式性、创造性以及人们对世界社会文化背景的不同看法,对其进行标注特别具有挑战性。随着 GPT 模型的出现及其在各种 NLP 任务中的熟练程度,本研究旨在为 GPT-4 作为社交媒体文本标注器建立性能基准。为了实现这一目标,我们使用了自己的推文数据集,这些推文经过专业标注,用于立场检测,三位标注员之间具有完全的评分者间一致性。我们尝试了三种技术:零样本、少样本和零样本与思维链提示,为标注任务创建提示。我们使用了四个不同标签集构建的训练集,包括人类标签,对基于转换器的大型语言模型进行微调,并对各种带有嵌入的传统机器学习模型进行微调,以进行立场分类。最后,所有经过微调的模型都使用带有人类生成标签的公共测试集进行评估。我们使用基于人类标签训练的模型的结果作为基准,评估 GPT-4 在三种提示技术中的标注能力。根据实验结果,GPT-4 通过少样本和零样本思维链提示方法实现了可比的结果。然而,这些标注技术都没有超过基于人类标签微调的前三个模型。此外,我们引入了零样本思维链提示作为一种有效的基于方面的社交媒体文本标注策略,其性能优于标准的零样本提示,并且与表现良好但昂贵的少样本提示方法的结果相似。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/9f870d9969e1/pone.0307741.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/89f3dad3b37a/pone.0307741.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/e9fe0d1fa156/pone.0307741.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/4a3222b7c6f1/pone.0307741.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/cff3cd8a2550/pone.0307741.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/3f21f67939bd/pone.0307741.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/81da7924dbc8/pone.0307741.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/6e1980893fb7/pone.0307741.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/b18f63566148/pone.0307741.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/9f870d9969e1/pone.0307741.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/89f3dad3b37a/pone.0307741.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/e9fe0d1fa156/pone.0307741.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/4a3222b7c6f1/pone.0307741.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/cff3cd8a2550/pone.0307741.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/3f21f67939bd/pone.0307741.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/81da7924dbc8/pone.0307741.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/6e1980893fb7/pone.0307741.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/b18f63566148/pone.0307741.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/654a/11326574/9f870d9969e1/pone.0307741.g009.jpg

相似文献

1
GPT-4 as an X data annotator: Unraveling its performance on a stance classification task.GPT-4 作为 X 数据标注员:在立场分类任务中表现如何。
PLoS One. 2024 Aug 15;19(8):e0307741. doi: 10.1371/journal.pone.0307741. eCollection 2024.
2
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
3
Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.使用暹罗神经网络的临床自然语言处理少样本学习:算法开发与验证研究
JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.
4
A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records.基于大型语言模型的生成式自然语言处理框架,在临床笔记上进行了微调,能够从电子健康记录中准确提取头痛频率。
Headache. 2024 Apr;64(4):400-409. doi: 10.1111/head.14702. Epub 2024 Mar 25.
5
Evaluating large language models for health-related text classification tasks with public social media data.利用公共社交媒体数据评估用于健康相关文本分类任务的大型语言模型。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210.
6
A Large Language Model-Based Generative Natural Language Processing Framework Finetuned on Clinical Notes Accurately Extracts Headache Frequency from Electronic Health Records.一种基于大语言模型的生成式自然语言处理框架,在临床笔记上进行微调后,能准确从电子健康记录中提取头痛频率。
medRxiv. 2023 Oct 3:2023.10.02.23296403. doi: 10.1101/2023.10.02.23296403.
7
GPT is an effective tool for multilingual psychological text analysis.GPT 是一种用于多语言心理文本分析的有效工具。
Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2308950121. doi: 10.1073/pnas.2308950121. Epub 2024 Aug 12.
8
Exploring Large Language Models for Detecting Online Vaccine Reactions.探索大型语言模型以检测在线疫苗反应。
Stud Health Technol Inform. 2024 Sep 24;318:30-35. doi: 10.3233/SHTI240887.
9
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
10
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.

引用本文的文献

1
Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study.大型语言模型在模拟人类专家对社交媒体上关于加热烟草制品的公众情绪评估方面的准确性:评估研究。
J Med Internet Res. 2025 Mar 4;27:e63631. doi: 10.2196/63631.

本文引用的文献

1
GPT is an effective tool for multilingual psychological text analysis.GPT 是一种用于多语言心理文本分析的有效工具。
Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2308950121. doi: 10.1073/pnas.2308950121. Epub 2024 Aug 12.
2
Developing a machine learning model to predict patient need for computed tomography imaging in the emergency department.开发一个机器学习模型,以预测急诊科患者对计算机断层扫描成像的需求。
PLoS One. 2022 Dec 15;17(12):e0278229. doi: 10.1371/journal.pone.0278229. eCollection 2022.
3
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.
马修斯相关系数(MCC)在二分类评估中优于 F1 得分和准确率的优势。
BMC Genomics. 2020 Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7.
4
Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate?测量名义数据的评分者间信度——哪些系数和置信区间是合适的?
BMC Med Res Methodol. 2016 Aug 5;16:93. doi: 10.1186/s12874-016-0200-9.
5
Interrater reliability: the kappa statistic.组内一致性:kappa 统计量。
Biochem Med (Zagreb). 2012;22(3):276-82.