• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

似真性机器常识(PMC)数据集:一个用于研究大语言模型中似真性的大规模众包人工标注数据集。

The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models.

作者信息

Nananukul Navapat, Shen Ke, Kejriwal Mayank

机构信息

University of Southern California, 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292, USA.

出版信息

Data Brief. 2024 Aug 24;57:110869. doi: 10.1016/j.dib.2024.110869. eCollection 2024 Dec.

DOI:10.1016/j.dib.2024.110869
PMID:39296626
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11408755/
Abstract

Commonsense reasoning has emerged as a challenging problem in Artificial Intelligence (AI). However, one area of commonsense reasoning that has not received nearly as much attention in the AI research community is , which focuses on determining the likelihood of commonsense statements. Human-annotated benchmarks are essential for advancing research in this nascent area, as they enable researchers to develop and evaluate AI models effectively. Because plausibility is a subjective concept, it is important to obtain nuanced annotations, rather than a binary label of 'plausible' or 'implausible'. Furthermore, it is also important to obtain multiple human annotations for a given statement, to ensure validity of the labels. In this data article, we describe the process of re-annotating an existing commonsense plausibility dataset (SemEval-2020 Task 4) using large-scale crowdsourcing on the Amazon Mechanical Turk platform. We obtain 10,000 unique annotations on a corpus of 2000 sentences (five independent annotations per sentence). Based on these labels, each was labelled as . Next, we prompted the GPT-3.5 and GPT-4 models developed by OpenAI. Sentences from the human-annotated files were fed into the models using custom prompt templates, and the models' generated labels were used to determine if they were aligned with those output by humans. The PMC-Dataset is meant to serve as a rich resource for analysing and comparing human and machine commonsense reasoning capabilities, specifically on plausibility. Researchers can utilise this dataset to train, fine-tune, and evaluate AI models on plausibility. Applications include: determining the likelihood of everyday events, assessing the realism of hypothetical scenarios, and distinguishing between plausible and implausible statements in commonsense text. Ultimately, we intend for the dataset to support ongoing AI research by offering a robust foundation for developing models that are better aligned with human commonsense reasoning.

摘要

常识推理已成为人工智能(AI)领域中一个具有挑战性的问题。然而,常识推理的一个领域在AI研究社区中并未受到如此多的关注,即专注于确定常识性陈述的可能性。人工标注的基准对于推动这个新兴领域的研究至关重要,因为它们使研究人员能够有效地开发和评估AI模型。由于合理性是一个主观概念,获得细致入微的标注而不是“合理”或“不合理”的二元标签很重要。此外,为给定陈述获得多个人工标注以确保标签的有效性也很重要。在这篇数据文章中,我们描述了使用亚马逊机械土耳其人平台上的大规模众包对现有的常识合理性数据集(SemEval-2020任务4)进行重新标注的过程。我们在2000个句子的语料库上获得了10000个独特的标注(每个句子五个独立标注)。基于这些标签,每个句子都被标记为 。接下来,我们调用了OpenAI开发的GPT-3.5和GPT-4模型。使用自定义提示模板将人工标注文件中的句子输入到模型中,并使用模型生成的标签来确定它们是否与人类输出的标签一致。PMC数据集旨在作为分析和比较人类和机器常识推理能力(特别是关于合理性)的丰富资源。研究人员可以利用这个数据集在合理性方面训练、微调并评估AI模型。应用包括:确定日常事件的可能性、评估假设情景的现实性以及区分常识文本中合理和不合理的陈述。最终,我们希望该数据集通过为开发与人类常识推理更好对齐的模型提供坚实基础来支持正在进行的AI研究。

相似文献

1
The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models.似真性机器常识(PMC)数据集:一个用于研究大语言模型中似真性的大规模众包人工标注数据集。
Data Brief. 2024 Aug 24;57:110869. doi: 10.1016/j.dib.2024.110869. eCollection 2024 Dec.
2
TG-CSR: A human-labeled dataset grounded in nine formal commonsense categories.TG-CSR:一个基于九个形式化常识类别的人工标注数据集。
Data Brief. 2023 Oct 11;51:109666. doi: 10.1016/j.dib.2023.109666. eCollection 2023 Dec.
3
A noise audit of human-labeled benchmarks for machine commonsense reasoning.机器常识推理的人工标注基准的噪声审计。
Sci Rep. 2024 Apr 14;14(1):8609. doi: 10.1038/s41598-024-58937-4.
4
CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.CRIC:一个用于视觉与常识组合推理的视觉问答数据集。
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):5561-5578. doi: 10.1109/TPAMI.2022.3210780. Epub 2023 Apr 3.
5
Robust Commonsense Reasoning Against Noisy Labels Using Adaptive Correction.使用自适应校正对有噪声标签进行稳健的常识推理
IEEE Trans Cybern. 2024 Jul;54(7):4138-4149. doi: 10.1109/TCYB.2023.3339629. Epub 2024 Jul 11.
6
CEG: A joint model for causal commonsense events enhanced story ending generation.CEG:因果常识事件增强型故事结尾生成的联合模型。
PLoS One. 2023 May 23;18(5):e0286049. doi: 10.1371/journal.pone.0286049. eCollection 2023.
7
CommonsenseVIS: Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language Models.常识视觉化(CommonsenseVIS):可视化与理解自然语言模型的常识推理能力
IEEE Trans Vis Comput Graph. 2023 Oct 26;PP. doi: 10.1109/TVCG.2023.3327153.
8
Diagnostic accuracy of large language models in psychiatry.精神科大语言模型的诊断准确性。
Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.
9
Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing.基于Web 2.0的众包方式用于临床自然语言处理中高质量金标准的开发。
J Med Internet Res. 2013 Apr 2;15(4):e73. doi: 10.2196/jmir.2426.
10
Leveraging Symbolic Knowledge Bases for Commonsense Natural Language Inference Using Pattern Theory.利用符号知识库和模式理论进行常识自然语言推理。
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13185-13202. doi: 10.1109/TPAMI.2023.3287837. Epub 2023 Oct 3.

本文引用的文献

1
A noise audit of human-labeled benchmarks for machine commonsense reasoning.机器常识推理的人工标注基准的噪声审计。
Sci Rep. 2024 Apr 14;14(1):8609. doi: 10.1038/s41598-024-58937-4.