Department of Biomedical Informatics, School of Medicine Emory University Atlanta GA.
Vanderbilt University Medical Center Vanderbilt University Nashville TN.
J Am Heart Assoc. 2023 Jul 4;12(13):e030046. doi: 10.1161/JAHA.123.030046. Epub 2023 Jun 22.
Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by () codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing-based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with code-based classification. Methods and Results We included free-text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non-Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer-based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and code-based classification on 20% of the held-out patient data using the score metric. The classification model, support vector machine, and RoBERTa achieved scores of 0.81 (95% CI, 0.79-0.83), 0.95 (95% CI, 0.92-0.97), and 0.89 (95% CI, 0.88-0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (<0.05), and both natural language processing models outperformed code-based classification (<0.05). The sliding window strategy improved performance over the base model (<0.05) but did not outperform support vector machines. code-based classification produced more false positives. Conclusions Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than codes, and the former demonstrated the possibility of further improvement.
背景 法洛四联症根治术后患者存在较高的发病率和死亡率。由于无法仅通过()代码识别法洛四联症根治术患者,因此难以创建大型法洛四联症根治术患者队列。我们试图开发基于自然语言处理的机器学习模型,以便从电子健康记录中的自由文本中自动检测法洛四联症根治术患者,并将其性能与基于代码的分类进行比较。
方法和结果 我们纳入了来自 2 个医疗系统的 10935 例经人工验证患者的自由文本记录,其中 778 例(7.1%)为法洛四联症根治术患者,10157 例(92.9%)为非法洛四联症根治术患者。我们使用 80%的数据对多种机器学习模型(支持向量机和 2 种 RoBERTa(一种用于语言理解的稳健优化的转换器模型))进行了训练和优化,以根据注释自动识别法洛四联症根治术患者。对于 RoBERTa,我们实施了一种新的滑动窗口策略来克服其长度限制。我们使用 评分指标在 20%的保留患者数据上评估了机器学习模型和基于代码的分类。分类模型、支持向量机和 RoBERTa 对阳性(法洛四联症根治术)类别的评分分别为 0.81(95%CI,0.79-0.83)、0.95(95%CI,0.92-0.97)和 0.89(95%CI,0.88-0.85)。支持向量机的性能最佳(<0.05),并且两种自然语言处理模型均优于基于代码的分类(<0.05)。滑动窗口策略提高了性能(<0.05),但不及支持向量机。基于代码的分类产生了更多的假阳性。
结论 基于自然语言处理的模型可以根据临床记录自动检测法洛四联症根治术患者,其准确率高于代码,且前者具有进一步改进的潜力。