Suppr超能文献

似真性机器常识(PMC)数据集:一个用于研究大语言模型中似真性的大规模众包人工标注数据集。

The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models.

作者信息

Nananukul Navapat, Shen Ke, Kejriwal Mayank

机构信息

University of Southern California, 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292, USA.

出版信息

Data Brief. 2024 Aug 24;57:110869. doi: 10.1016/j.dib.2024.110869. eCollection 2024 Dec.

Abstract

Commonsense reasoning has emerged as a challenging problem in Artificial Intelligence (AI). However, one area of commonsense reasoning that has not received nearly as much attention in the AI research community is , which focuses on determining the likelihood of commonsense statements. Human-annotated benchmarks are essential for advancing research in this nascent area, as they enable researchers to develop and evaluate AI models effectively. Because plausibility is a subjective concept, it is important to obtain nuanced annotations, rather than a binary label of 'plausible' or 'implausible'. Furthermore, it is also important to obtain multiple human annotations for a given statement, to ensure validity of the labels. In this data article, we describe the process of re-annotating an existing commonsense plausibility dataset (SemEval-2020 Task 4) using large-scale crowdsourcing on the Amazon Mechanical Turk platform. We obtain 10,000 unique annotations on a corpus of 2000 sentences (five independent annotations per sentence). Based on these labels, each was labelled as . Next, we prompted the GPT-3.5 and GPT-4 models developed by OpenAI. Sentences from the human-annotated files were fed into the models using custom prompt templates, and the models' generated labels were used to determine if they were aligned with those output by humans. The PMC-Dataset is meant to serve as a rich resource for analysing and comparing human and machine commonsense reasoning capabilities, specifically on plausibility. Researchers can utilise this dataset to train, fine-tune, and evaluate AI models on plausibility. Applications include: determining the likelihood of everyday events, assessing the realism of hypothetical scenarios, and distinguishing between plausible and implausible statements in commonsense text. Ultimately, we intend for the dataset to support ongoing AI research by offering a robust foundation for developing models that are better aligned with human commonsense reasoning.

摘要

常识推理已成为人工智能(AI)领域中一个具有挑战性的问题。然而,常识推理的一个领域在AI研究社区中并未受到如此多的关注,即专注于确定常识性陈述的可能性。人工标注的基准对于推动这个新兴领域的研究至关重要,因为它们使研究人员能够有效地开发和评估AI模型。由于合理性是一个主观概念,获得细致入微的标注而不是“合理”或“不合理”的二元标签很重要。此外,为给定陈述获得多个人工标注以确保标签的有效性也很重要。在这篇数据文章中,我们描述了使用亚马逊机械土耳其人平台上的大规模众包对现有的常识合理性数据集(SemEval-2020任务4)进行重新标注的过程。我们在2000个句子的语料库上获得了10000个独特的标注(每个句子五个独立标注)。基于这些标签,每个句子都被标记为 。接下来,我们调用了OpenAI开发的GPT-3.5和GPT-4模型。使用自定义提示模板将人工标注文件中的句子输入到模型中,并使用模型生成的标签来确定它们是否与人类输出的标签一致。PMC数据集旨在作为分析和比较人类和机器常识推理能力(特别是关于合理性)的丰富资源。研究人员可以利用这个数据集在合理性方面训练、微调并评估AI模型。应用包括:确定日常事件的可能性、评估假设情景的现实性以及区分常识文本中合理和不合理的陈述。最终,我们希望该数据集通过为开发与人类常识推理更好对齐的模型提供坚实基础来支持正在进行的AI研究。

相似文献

4
CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.CRIC:一个用于视觉与常识组合推理的视觉问答数据集。
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):5561-5578. doi: 10.1109/TPAMI.2022.3210780. Epub 2023 Apr 3.
5
8
Diagnostic accuracy of large language models in psychiatry.精神科大语言模型的诊断准确性。
Asian J Psychiatr. 2024 Oct;100:104168. doi: 10.1016/j.ajp.2024.104168. Epub 2024 Jul 25.
10
Leveraging Symbolic Knowledge Bases for Commonsense Natural Language Inference Using Pattern Theory.利用符号知识库和模式理论进行常识自然语言推理。
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13185-13202. doi: 10.1109/TPAMI.2023.3287837. Epub 2023 Oct 3.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验