Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia.
Department of Computer Science, Durham University, Upper Mountjoy Campus, Stockton Road, Durham DH1 3LE, UK.
Sensors (Basel). 2022 Dec 8;22(24):9609. doi: 10.3390/s22249609.
This study aims to develop and evaluate an automated system for extracting information related to patient substance use (smoking, alcohol, and drugs) from unstructured clinical text (medical discharge records). The authors propose a four-stage system for the extraction of the substance-use status and related attributes (type, frequency, amount, quit-time, and period). The first stage uses a keyword search technique to detect sentences related to substance use and to exclude unrelated records. In the second stage, an extension of the NegEx negation detection algorithm is developed and employed for detecting the negated records. The third stage involves identifying the temporal status of the substance use by applying windowing and chunking methodologies. Finally, in the fourth stage, regular expressions, syntactic patterns, and keyword search techniques are used in order to extract the substance-use attributes. The proposed system achieves an F1-score of up to 0.99 for identifying substance-use-related records, 0.98 for detecting the negation status, and 0.94 for identifying temporal status. Moreover, F1-scores of up to 0.98, 0.98, 1.00, 0.92, and 0.98 are achieved for the extraction of the amount, frequency, type, quit-time, and period attributes, respectively. Natural Language Processing (NLP) and rule-based techniques are employed efficiently for extracting substance-use status and attributes, with the proposed system being able to detect substance-use status and attributes over both sentence-level and document-level data. Results show that the proposed system outperforms the compared state-of-the-art substance-use identification system on an unseen dataset, demonstrating its generalisability.
本研究旨在开发和评估一种从非结构化临床文本(医疗出院记录)中提取与患者物质使用(吸烟、饮酒和药物)相关信息的自动化系统。作者提出了一个四阶段系统,用于提取物质使用状态和相关属性(类型、频率、数量、戒烟时间和时间段)。第一阶段使用关键字搜索技术来检测与物质使用相关的句子,并排除不相关的记录。在第二阶段,开发并应用了 NegEx 否定检测算法的扩展版本来检测否定记录。第三阶段通过应用窗口化和分块方法来确定物质使用的时间状态。最后,在第四阶段,使用正则表达式、语法模式和关键字搜索技术来提取物质使用属性。所提出的系统在识别与物质使用相关的记录方面达到了高达 0.99 的 F1 分数,在检测否定状态方面达到了 0.98,在识别时间状态方面达到了 0.94。此外,在提取数量、频率、类型、戒烟时间和时间段属性方面,分别达到了高达 0.98、0.98、1.00、0.92 和 0.98 的 F1 分数。自然语言处理 (NLP) 和基于规则的技术被有效地用于提取物质使用状态和属性,所提出的系统能够在句子级和文档级数据上检测物质使用状态和属性。结果表明,所提出的系统在未见过的数据集上优于比较的物质使用识别系统,证明了其泛化能力。