Nananukul Navapat, Shen Ke, Kejriwal Mayank
University of Southern California, 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292, USA.
Data Brief. 2024 Aug 24;57:110869. doi: 10.1016/j.dib.2024.110869. eCollection 2024 Dec.
Commonsense reasoning has emerged as a challenging problem in Artificial Intelligence (AI). However, one area of commonsense reasoning that has not received nearly as much attention in the AI research community is , which focuses on determining the likelihood of commonsense statements. Human-annotated benchmarks are essential for advancing research in this nascent area, as they enable researchers to develop and evaluate AI models effectively. Because plausibility is a subjective concept, it is important to obtain nuanced annotations, rather than a binary label of 'plausible' or 'implausible'. Furthermore, it is also important to obtain multiple human annotations for a given statement, to ensure validity of the labels. In this data article, we describe the process of re-annotating an existing commonsense plausibility dataset (SemEval-2020 Task 4) using large-scale crowdsourcing on the Amazon Mechanical Turk platform. We obtain 10,000 unique annotations on a corpus of 2000 sentences (five independent annotations per sentence). Based on these labels, each was labelled as . Next, we prompted the GPT-3.5 and GPT-4 models developed by OpenAI. Sentences from the human-annotated files were fed into the models using custom prompt templates, and the models' generated labels were used to determine if they were aligned with those output by humans. The PMC-Dataset is meant to serve as a rich resource for analysing and comparing human and machine commonsense reasoning capabilities, specifically on plausibility. Researchers can utilise this dataset to train, fine-tune, and evaluate AI models on plausibility. Applications include: determining the likelihood of everyday events, assessing the realism of hypothetical scenarios, and distinguishing between plausible and implausible statements in commonsense text. Ultimately, we intend for the dataset to support ongoing AI research by offering a robust foundation for developing models that are better aligned with human commonsense reasoning.
常识推理已成为人工智能(AI)领域中一个具有挑战性的问题。然而,常识推理的一个领域在AI研究社区中并未受到如此多的关注,即专注于确定常识性陈述的可能性。人工标注的基准对于推动这个新兴领域的研究至关重要,因为它们使研究人员能够有效地开发和评估AI模型。由于合理性是一个主观概念,获得细致入微的标注而不是“合理”或“不合理”的二元标签很重要。此外,为给定陈述获得多个人工标注以确保标签的有效性也很重要。在这篇数据文章中,我们描述了使用亚马逊机械土耳其人平台上的大规模众包对现有的常识合理性数据集(SemEval-2020任务4)进行重新标注的过程。我们在2000个句子的语料库上获得了10000个独特的标注(每个句子五个独立标注)。基于这些标签,每个句子都被标记为 。接下来,我们调用了OpenAI开发的GPT-3.5和GPT-4模型。使用自定义提示模板将人工标注文件中的句子输入到模型中,并使用模型生成的标签来确定它们是否与人类输出的标签一致。PMC数据集旨在作为分析和比较人类和机器常识推理能力(特别是关于合理性)的丰富资源。研究人员可以利用这个数据集在合理性方面训练、微调并评估AI模型。应用包括:确定日常事件的可能性、评估假设情景的现实性以及区分常识文本中合理和不合理的陈述。最终,我们希望该数据集通过为开发与人类常识推理更好对齐的模型提供坚实基础来支持正在进行的AI研究。