Suppr超能文献

LERCause:从核安全报告中识别因果句的深度学习方法。

LERCause: Deep learning approaches for causal sentence identification from nuclear safety reports.

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, Illinois, United States of America.

School of Information, Florida State University, Tallahassee, Florida, United States of America.

出版信息

PLoS One. 2024 Aug 22;19(8):e0308155. doi: 10.1371/journal.pone.0308155. eCollection 2024.

Abstract

Identifying causal sentences from nuclear incident reports is essential for advancing nuclear safety research and applications. Nonetheless, accurately locating and labeling causal sentences in text data is challenging, and might benefit from the usage of automated techniques. In this paper, we introduce LERCause, a labeled dataset combined with labeling methods meant to serve as a foundation for the classification of causal sentences in the domain of nuclear safety. We used three BERT models (BERT, BioBERT, and SciBERT) to 10,608 annotated sentences from the Licensee Event Report (LER) corpus for predicting sentence labels (Causal vs. non-Causal). We also used a keyword-based heuristic strategy, three standard machine learning methods (Logistic Regression, Gradient Boosting, and Support Vector Machine), and a deep learning approach (Convolutional Neural Network; CNN) for comparison. We found that the BERT-centric models outperformed all other tested models in terms of all evaluation metrics (accuracy, precision, recall, and F1 score). BioBERT resulted in the highest overall F1 score of 94.49% from the ten-fold cross-validation. Our dataset and coding framework can provide a robust baseline for assessing and comparing new causal sentences extraction techniques. As far as we know, our research breaks new ground by leveraging BERT-centric models for causal sentence classification in the nuclear safety domain and by openly distributing labeled data and code to enable reproducibility in subsequent research.

摘要

从核事故报告中识别因果句对于推进核安全研究和应用至关重要。然而,准确地在文本数据中定位和标记因果句具有挑战性,并且可能受益于自动化技术的使用。在本文中,我们引入了 LERCause,这是一个结合了标注方法的标注数据集,旨在为核安全领域因果句的分类提供基础。我们使用了三个 BERT 模型(BERT、BioBERT 和 SciBERT)对许可证事件报告(LER)语料库中的 10608 个标注句子进行预测,以预测句子标签(因果与非因果)。我们还使用了基于关键字的启发式策略、三种标准机器学习方法(逻辑回归、梯度提升和支持向量机)以及深度学习方法(卷积神经网络;CNN)进行比较。我们发现,在所有评估指标(准确性、精度、召回率和 F1 得分)方面,基于 BERT 的模型都优于所有其他测试模型。在十折交叉验证中,BioBERT 获得了最高的总体 F1 得分为 94.49%。我们的数据集和编码框架可以为评估和比较新的因果句提取技术提供一个强大的基准。据我们所知,我们的研究通过在核安全领域使用基于 BERT 的模型进行因果句分类,并公开分发标注数据和代码,为后续研究的可重复性开辟了新的途径,这是一项开创性的工作。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d9f1/11340986/dcfe0687952e/pone.0308155.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验