Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China; Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences, Beijing 100020, China.
Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China; School of Health Care Technology, Dalian Neusoft University of Information, Dalian 116023, China.
Artif Intell Med. 2023 Jun;140:102552. doi: 10.1016/j.artmed.2023.102552. Epub 2023 Apr 23.
Stroke is one of the leading causes of death and disability worldwide. The National Institutes of Health Stroke Scale (NIHSS) scores in electronic health records (EHRs), which quantitatively describe patients' neurological deficits in evidence-based treatment, are crucial in stroke-related clinical investigations. However, the free-text format and lack of standardization inhibit their effective use. Automatically extracting the scale scores from the clinical free text so that its potential value in real-world studies is realized has become an important goal.
This study aims to develop an automated method to extract scale scores from the free text of EHRs.
We propose a two-step pipeline method to identify NIHSS items and numerical scores and validate its feasibility using a freely accessible critical care database: MIMIC-III (Medical Information Mart for Intensive Care III). First, we utilize MIMIC-III to create an annotated corpus. Then, we investigate possible machine learning methods for two subtasks, NIHSS item and score recognition and item-score relation extraction. In the evaluation, we conduct both task-specific and end-to-end evaluations and compare our method with the rule-based method using precision, recall and F1 scores as evaluation metrics.
We use all available discharge summaries of stroke cases in MIMIC-III. The annotated NIHSS corpus contains 312 cases, 2929 scale items, 2774 scores and 2733 relations. The results show that the best F1-score of our method was 0.9006, which was attained by combining BERT-BiLSTM-CRF and Random Forest, and it outperformed the rule-based method (F1-score = 0.8098). In the end-to-end task, our method could successfully recognize the item "1b level of consciousness questions", the score "1" and their relation "('1b level of consciousness questions', '1', 'has value')" from the sentence "1b level of consciousness questions: said name = 1", while the rule-based method could not.
The two-step pipeline method we propose is an effective approach to identify NIHSS items, scores and their relations. With its help, clinical investigators can easily retrieve and access structured scale data, thereby supporting stroke-related real-world studies.
中风是全球范围内导致死亡和残疾的主要原因之一。国立卫生研究院中风量表(NIHSS)评分记录在电子健康记录(EHR)中,它定量描述了患者在循证治疗中的神经功能缺陷,在中风相关的临床研究中至关重要。然而,由于其自由文本格式和缺乏标准化,限制了其有效使用。自动从临床自由文本中提取量表评分,使其在真实世界研究中的潜在价值得以实现,已成为一个重要目标。
本研究旨在开发一种从 EHR 临床自由文本中自动提取量表评分的方法。
我们提出了一种两步流水线方法来识别 NIHSS 项目和数值评分,并使用可免费访问的重症监护数据库 MIMIC-III(医疗信息集市用于重症监护 III)来验证其可行性。首先,我们利用 MIMIC-III 创建一个带注释的语料库。然后,我们研究了 NIHSS 项目和分数识别以及项目-分数关系提取这两个子任务的可能机器学习方法。在评估中,我们进行了特定任务和端到端的评估,并使用精度、召回率和 F1 分数作为评估指标,将我们的方法与基于规则的方法进行了比较。
我们使用 MIMIC-III 中所有可用的中风病例的出院总结。注释的 NIHSS 语料库包含 312 个病例、2929 个量表项目、2774 个评分和 2733 个关系。结果表明,我们的方法的最佳 F1 得分为 0.9006,是通过结合 BERT-BiLSTM-CRF 和随机森林实现的,优于基于规则的方法(F1 得分为 0.8098)。在端到端任务中,我们的方法可以成功地从句子“1b 意识水平问题:说名字=1”中识别出项目“1b 意识水平问题”、分数“1”及其关系“(‘1b 意识水平问题’,‘1’,‘具有值’)”,而基于规则的方法则无法识别。
我们提出的两步流水线方法是一种识别 NIHSS 项目、评分及其关系的有效方法。有了它的帮助,临床研究人员可以方便地检索和访问结构化的量表数据,从而支持中风相关的真实世界研究。