Lima Weslley, Silva Victor, Silva Jasson, Lira Ricardo, Paiva Anselmo
Federal University of Piauí. Campus Universitário Ministro Petrônio Portella. Teresina, Piauí, Brazil.
Federal University of Maranhão. Av. dos Portugueses, 1966 São Luís, Maranhão, Brazil.
Data Brief. 2024 Dec 9;58:111202. doi: 10.1016/j.dib.2024.111202. eCollection 2025 Feb.
Digital transformation has significantly impacted public procurement, improving operational efficiency, transparency, and competition. This transformation has allowed the automation of data analysis and oversight in public administration. Public procurement involves various stages and generates a multitude of documents. However, experts manually analyze these unstructured textual documents, which are time-consuming and inefficient. To address this issue, we introduce BidCorpus, a novel and comprehensive dataset consisting of thousands of documents related to public procurement, specifically bidding notices from Brazilian public websites. The dataset was labeled using weak supervision techniques, manual labeling, and BERT-based language models. Models trained with these annotated data showed promising results, with metrics greater than 80 % in various experiments. The models could also tolerate intentional changes made to bidding notices to evade fraud detection. All the resources from this work are publicly available, including the documents, pre-processing scripts, and training and evaluation of the models. We expect the dataset and its labels to be of great value to researchers working on public procurement problems.
数字转型对公共采购产生了重大影响,提高了运营效率、透明度和竞争力。这种转型使得公共行政中的数据分析和监督实现了自动化。公共采购涉及多个阶段,并产生大量文件。然而,专家们需要手动分析这些非结构化文本文件,这既耗时又低效。为了解决这个问题,我们引入了BidCorpus,这是一个新颖且全面的数据集,由数千份与公共采购相关的文件组成,特别是来自巴西公共网站的招标公告。该数据集使用弱监督技术、人工标注和基于BERT的语言模型进行标注。用这些标注数据训练的模型显示出了有前景的结果,在各种实验中指标超过80%。这些模型还能够容忍对招标公告进行的故意更改,以逃避欺诈检测。这项工作的所有资源都是公开可用的,包括文件、预处理脚本以及模型的训练和评估。我们期望该数据集及其标签对研究公共采购问题的研究人员具有巨大价值。