Gabud Roselyn, Lapitan Portia, Mariano Vladimir, Mendoza Eduardo, Pampolina Nelson, Clariño Maria Art Antonette, Batista-Navarro Riza
Department of Computer Science, College of Engineering, University of the Philippines Diliman, Quezon City, Philippines.
Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines.
Front Artif Intell. 2024 May 23;7:1371411. doi: 10.3389/frai.2024.1371411. eCollection 2024.
Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species' time-sensitive reproductive conditions and location-specific habitats.
We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches.
Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.
关于植物物种栖息地和繁殖条件的细粒度、描述性信息对于森林恢复和重建工作至关重要。对于具有短寿命顽拗性种子的热带植物物种以及那些表现出复杂繁殖模式的物种,例如具有可能不定期发生的超年度大规模开花事件的物种,精确的果实采集时间以及对物种栖息地偏好和繁殖状态的了解是必要的。通过提供对结构化信息的访问,例如知识库中的信息,这些信息跨越数年甚至数十年并覆盖广泛的地理位置,可以有助于以规划有效重新造林的方式理解植物再生。这样一个资源的内容可以用从文献中获取的关于物种对时间敏感的繁殖条件和特定地点栖息地的信息来丰富。
我们试图开发无监督方法来提取与植物物种的栖息地及其位置、繁殖条件以及相应时间信息相关的关系。首先,我们为传统的基于规则的模式匹配方法精心制定规则。然后,我们基于变压器模型(即文本到文本转移变压器(T5))开发了一种关系提取方法,将关系提取问题转化为问答和自然语言推理任务。然后,我们提出了一种新颖的无监督混合方法,该方法结合了我们基于规则和基于变压器的方法。
在以生物多样性为重点的文档注释语料库上对我们的混合方法进行评估表明,召回率提高了多达15个百分点,并且在基于繁殖条件 - 时间表达关系的F1分数方面,相对于仅基于规则和基于变压器的方法具有最佳性能,范围从89.61%到96.75%,对于栖息地 - 地理位置关系,F1分数范围从85.39%到89.90%。我们的工作表明,即使不在任何特定领域的标记数据集上训练模型,我们也能够从文献中提取生物多样性概念之间的关系,并且性能令人满意。