Tomova Mihaela Todorova, Hofmann Martin, Mäder Patrick
Technische Universität Ilmenau, Ilmenau 98693, Germany.
Faculty of Biological Sciences, Friedrich Schiller University, Jena 07745, Germany.
Data Brief. 2022 Apr 27;42:108211. doi: 10.1016/j.dib.2022.108211. eCollection 2022 Jun.
Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-SQL benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSS-Queries dataset consisting of natural language utterances and accompanying SQL queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 SQL queries; each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained SQLNet and RatSQL baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and SQL queries is hosted at figshare.com/s/75ed49ef01ac2f83b3e2.
软件开发项目的利益相关者在日常工作中需要各种信息来做出合理决策。满足这些需求需要大量了解相关信息的存储位置和方式,这会消耗大量宝贵时间,而这些时间往往并不充裕。减轻对这些知识的需求是一个理想的文本到SQL基准问题,在这个领域中,公共数据集稀缺且急需。我们提出了SEOSS-Queries数据集,该数据集由从先前研究、软件项目、问题跟踪工具中提取的自然语言语句以及通过专家调查得出的伴随SQL查询组成,以涵盖各种信息需求视角。我们的数据集包含1162条英语语句,可翻译成166个SQL查询;每个查询有四个精确语句和另外三个通用语句。此外,该数据集还包含从问题跟踪器评论中提取的393,086条带标签的语句。我们提供了用于基准比较的预训练SQLNet和RatSQL基线模型、一个便于无缝应用的复制包,并讨论了可以使用该数据集解决和评估的各种其他任务。带有释义自然语言语句和SQL查询的整个数据集托管在figshare.com/s/75ed49ef01ac2f83b3e2上。