Department of Computing Sciences, Bocconi University, Milano, Italy; Artificial Intelligence Center, Humanitas Clinical and Research Center - IRCCS, Via A. Manzoni 56, Rozzano 20089, Milan, Italy.
Department of Biomedical Sciences, Humanitas University, via Rita Levi Montalcini 4, 20072 Pieve Emanuele, Milan, Italy.
Int J Med Inform. 2024 Dec;192:105626. doi: 10.1016/j.ijmedinf.2024.105626. Epub 2024 Sep 19.
Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients' medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.
Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients' comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients' functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.
The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.
We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.
数据收集通常依赖于耗时的手动输入,大量信息嵌入在非结构化文本中,如患者的病历和临床记录。我们的研究旨在开发一个结合主动学习(AL)和 NLP 技术的管道,以增强急性缺血性卒中队列中的数据提取。
连续纳入在 IRCCS Humanitas 研究医院接受再灌注治疗的急性缺血性卒中患者。意大利 NLP 双向编码器表示从变压器(BERT)模型使用 AL 进行训练,以自动从电子健康文本中提取临床变量。在一组代表患者合并症的标签上评估了模拟主动学习性能,比较了贝叶斯不确定性抽样不一致(BALD)和随机文本选择。使用梯度提升在手动标记和半自动提取数据上训练预测患者功能结果的预后模型,并比较其性能。
主动学习过程最初表现为零性能,直到大约 20%的文本被标记,这可能是由于 BERT 模型的根层冻结,但总体而言,主动学习提高了模型在大多数合并症中的学习效率。预后建模表明,在手动标记与半自动提取数据上训练的模型之间,性能没有显著差异,表明在两种设置下都具有有效的预测能力。
我们开发了一种有效的语言模型,可从意大利语非结构化健康文本中自动提取缺血性卒中患者队列的临床数据。在初步分析中,我们证明了其提高预测模型准确性的潜在适用性。