UNICAEN, ENSICAEN, CNRS - UMR GREYC, 14000, Caen, France.
INRAE, F-34398, Montpellier, France.
Sci Data. 2023 Nov 22;10(1):818. doi: 10.1038/s41597-023-02705-y.
Land artificialization is a serious problem of civilization. Urban planning and natural risk management are aimed to improve it. In France, these practices operate the Local Land Plans (PLU - Plan Local d'Urbanisme) and the Natural risk prevention plans (PPRn - Plan de Prévention des Risques naturels) containing land use rules. To facilitate automatic extraction of the rules, we manually annotated a number of those documents concerning Montpellier, a rapidly evolving agglomeration exposed to natural risks. We defined a format for labeled examples in which each entry includes title and subtitle. In addition, we proposed a hierarchical representation of class labels to generalize the use of our corpus. Our corpus, consisting of 1934 textual segments, each of which labeled by one of the 4 classes (Verifiable, Non-verifiable, Informative and Not pertinent) is the first corpus in the French language in the fields of urban planning and natural risk management. Along with presenting the corpus, we tested a state-of-the-art approach for text classification to demonstrate its usability for automatic rule extraction.
土地人工化是文明的一个严重问题。城市规划和自然风险管理旨在改善这一问题。在法国,这些实践操作涉及到地方土地规划(PLU-城市规划计划)和自然风险预防计划(PPRn-自然风险预防计划),其中包含土地使用规则。为了方便自动提取规则,我们手动标注了一些关于蒙彼利埃的文件,蒙彼利埃是一个面临自然风险的快速发展的聚居区。我们定义了一个带标签示例的格式,其中每个条目都包含标题和副标题。此外,我们提出了一种分类标签的层次表示法,以推广我们语料库的使用。我们的语料库由 1934 个文本片段组成,每个片段都被标记为 4 个类别之一(可核实的、不可核实的、信息性的和不相关的),这是城市规划和自然风险管理领域的第一个法语语料库。除了介绍语料库外,我们还测试了一种最先进的文本分类方法,以证明其在自动规则提取方面的可用性。