Zong Zhihao, Shan Hongtao, Zhang Gaoyu, Yuan George Xianzhi, Zhang Shuyi
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China.
School of Information Management, Shanghai Lixin University of Accounting and Finance, Shanghai, China.
PLoS One. 2025 Jul 8;20(7):e0327186. doi: 10.1371/journal.pone.0327186. eCollection 2025.
Various plant attributes, such as growing environment, growth cycle, and ecological distribution, can provide support to fields like agricultural production and biodiversity. This information is widely dispersed in texts. Manual extraction of this information is highly inefficient due to a fact that it not only takes considerable time but also increases the likelihood of overlooking relevant details. To convert textual data into structured information, we extract relational triples in the form of (subject, relation, object), where the subject represents the names of plants, the object represents the plant attributes, and the relation represents the classification of plant attributes. To reduce complexity, we employ a joint extraction of entities and relations based on a tagging scheme. The task is broken down into three parts. Firstly, a matrix is used to simultaneously match plant entities and plant attributes. Then, the predefined categories of plant attributes are classified. Finally, the categories of plant attributes are matched with entity-attribute pairs. The tagging-based method typically utilizes parameter sharing to facilitate interaction between different tasks, but it can also lead to issues such as error amplification and instability in parameter updates. Thus, we adopt improved techniques at different stages to enhance the performance of our model. This includes adjustment to the word embedding layer of BERT and optimization in relation prediction. The modification of the word embedding layer is intended to better integrate contextual information during text representation and reduce the interference of erroneous information. The relation prediction part mainly involves multi-level information fusion of textual information, thereby making corrections and highlighting important information. We name the three-stage method as "Bwdgv". Compared to the currently advanced PRGC model, the F1-score of the proposed method has an improvement of 1.4%. With the help of extracted triples, we can construct knowledge graphs and other tasks to better apply various plant attributes.
各种植物属性,如生长环境、生长周期和生态分布等,可为农业生产和生物多样性等领域提供支持。这些信息广泛分散在文本中。由于手动提取这些信息不仅耗时较长,而且增加了忽略相关细节的可能性,因此效率极低。为了将文本数据转换为结构化信息,我们以(主语,关系,宾语)的形式提取关系三元组,其中主语代表植物名称,宾语代表植物属性,关系代表植物属性的分类。为了降低复杂性,我们采用基于标记方案的实体和关系联合提取方法。该任务分为三个部分。首先,使用一个矩阵同时匹配植物实体和植物属性。然后,对预定义的植物属性类别进行分类。最后,将植物属性类别与实体 - 属性对进行匹配。基于标记的方法通常利用参数共享来促进不同任务之间的交互,但也可能导致诸如误差放大和参数更新不稳定等问题。因此,我们在不同阶段采用改进技术来提高模型性能。这包括对BERT词嵌入层的调整以及关系预测的优化。词嵌入层的修改旨在在文本表示过程中更好地整合上下文信息并减少错误信息的干扰。关系预测部分主要涉及文本信息的多级信息融合,从而进行修正并突出重要信息。我们将这种三阶段方法命名为“Bwdgv”。与当前先进的PRGC模型相比,该方法的F1分数提高了1.4%。借助提取的三元组,我们可以构建知识图谱等任务,以便更好地应用各种植物属性。