Samlani Nouha, Pino Daphne Silva, Bertolo Reginaldo, Pak Tannaz
Teesside University, Middlesbrough, UK.
Brazilian Synchrotron Light Laboratory (LNLS), Campinas, Brazil.
Sci Data. 2024 Mar 2;11(1):263. doi: 10.1038/s41597-024-03068-8.
In the Brazilian state of São Paulo, contaminated sites (CSs) constitute threats to health, environment and socioeconomic situation of populations. Over the past two decades, the Environmental Agency of São Paulo (CETESB) has monitored these known CSs. This paper discusses the produced dataset through digitising the CETESB reports and making them accessible to the public in English. The dataset reports on qualitative aspects of contamination within the registered sites (e.g., contamination type and spread) and their management status. The data was extracted from CETESB reports using a machine-learning computer vision algorithm. It comprises two components: an optical character recognition (OCR) engine for text extraction and a convolutional neural network (CNN) image classifier to identify checked boxes. The digitisation was followed by harmonisation and quality assurance processes to ensure the consistency and validity of the data. Making this dataset accessible will allow future work on predictive analysis and decision-making and will inform the required policy-making to improve the management of the CSs in Brazil.
在巴西圣保罗州,受污染场地对当地居民的健康、环境和社会经济状况构成威胁。在过去二十年里,圣保罗环境局(CETESB)一直在监测这些已知的受污染场地。本文通过将CETESB报告数字化并以英文向公众公开,讨论了由此产生的数据集。该数据集报告了注册场地内污染的定性方面(如污染类型和扩散情况)及其管理状况。数据是使用机器学习计算机视觉算法从CETESB报告中提取的。它由两个部分组成:一个用于文本提取的光学字符识别(OCR)引擎和一个用于识别勾选框的卷积神经网络(CNN)图像分类器。数字化之后进行了协调和质量保证流程,以确保数据的一致性和有效性。公开这个数据集将有助于未来进行预测分析和决策,并为巴西改善受污染场地管理所需的政策制定提供信息。