用于预测脊柱手术围手术期安全指标的多模态机器学习

Multimodal machine learning for predicting perioperative safety indicators in spinal surgery.

作者信息

Mani Kyle, Scharfenberger Thomas, Goldman Samuel N, Kleinbart Emily, Mostafa Evan, Ramos Rafael De La Garza, Fourman Mitchell S, Eleswarapu Ananth

机构信息

Albert Einstein College of Medicine, Bronx, NY, USA.

Department of Orthopaedic Surgery, Montefiore Medical Center, Bronx, NY, USA.

出版信息

Spine J. 2025 Mar 29. doi: 10.1016/j.spinee.2025.03.021.

BACKGROUND CONTEXT

Machine learning (ML) algorithms can utilize the large amount of tabular data in electronic health records (EHRs) to predict perioperative safety indicators. Integrating unstructured free-text inputs via natural language processing (NLP) may further enhance predictive accuracy.

PURPOSE

To design and validate a preoperative multimodal ML architecture that integrates structured EHR data (patient demographics, comorbidities, and clinical covariates) with unstructured free-text inputs (past medical and surgical history, medications, and problem lists) via NLP. The multimodal models aim to improve the prediction of perioperative safety indicators compared to baseline ML models that only use structured tabular EHR data.

STUDY DESIGN

Retrospective cohort study.

PATIENT SAMPLE

1,898 patients admitted for elective or emergency spine surgery at four separate large urban academic spine centers during a 5-year period from 2018 to 2023.

OUTCOME MEASURES

Numerical outputs between 0 and 1 corresponding to the likelihood of (I) extended length of stay (LOS), (II) 90-day reoperation, and (III) perioperative intensive care unit (ICU) admission.

METHODS

We predicted the following safety indicators (I) extended length of stay (LOS), (II) 90-day reoperation, and (III) perioperative intensive care unit (ICU) admission. The quanteda package for NLP within the R environment was utilized to preprocess free-text EHR inputs. The refined text was tokenized and transformed into numerical vectors using a bag-of-words approach and integrated with the tabular EHR data to create a document-feature matrix. Two extreme gradient boosted (XGBoost) ML models were trained: a base model utilizing only structured tabular EHR data and a combined multimodal model that leveraged both combined structured tabular EHR data with numerical vectors derived from free-text NLP inputs. Hyperparameter tuning was performed via grid search, and the models were validated using 10-fold cross validation with an 80:20 training/testing split. Word clouds were generated for the free-text data and explainable artificial intelligence (XAI) techniques were employed for feature importance. Metrics calculated for model performance included Area Under the Receiving-Operating Characteristic Curve (AUC-ROC), Brier score, Calibration slope, Calibration Intercept, Precision, Recall and F1-Score.

RESULTS

1,898 patients (60.7% female) were extracted from January 2018 to September 2023, with a median age of 60.0 (IQR: 52.0-68.0) and median body mass index (BMI) of 30.3 kgm (IQR: 26.3-34.6). Extended LOS was defined as ≥ 14.4 days, constituting 10.1% of all individuals. The median LOS for the entire cohort was 4.0 days (IQR: 2.0-7.0), while the 90-day reoperation rate was 10.54%, and the ICU admission rate was 7.74%. The preoperative tabular EHR models predicted perioperative safety indicators with AUC ranging from 0.770 to 0.779, Brier scores ranging from 0.074 to 0.099, and calibration slopes ranging from 2.279 to 2.418. Precision and recall for this model ranged from 0.918 to 0.973 and 0.988 to 0.994, respectively, resulting in F1-scores between 0.954 and 0.973. The combined multimodal models predicted perioperative safety indicators with AUC ranging from 0.827 to 0.903, Brier scores ranging from 0.056 to 0.083, and calibration slopes ranging from 0.755 to 1.217. The multimodal models achieved precision ranging from 0.909 to 0.933 and recall ranging from 0.979 to 0.994, leading to F1-scores between 0.943 and 0.962. Important tabular predictors included patient age, BMI, hemoglobin level, white blood cell count, platelet count, and a combined anterior/posterior spinal fusion approach. Important free-text inputs included vertebral osteomyelitis, radiculopathy, myelopathy, and spinal metastasis.

CONCLUSIONS

The multimodal NLP model exhibited superior performance in all outcome measures when compared to the baseline tabular model. Future work includes incorporating additional model dimensions, such as the history of present illness, physical exam, and spinal imaging, and clinically implementing the models into our informed consent and preoperative optimization pathway.

背景

机器学习（ML）算法可利用电子健康记录（EHR）中的大量表格数据来预测围手术期安全指标。通过自然语言处理（NLP）整合非结构化自由文本输入可能会进一步提高预测准确性。

目的

设计并验证一种术前多模态ML架构，该架构通过NLP将结构化EHR数据（患者人口统计学、合并症和临床协变量）与非结构化自由文本输入（既往病史和手术史、药物和问题清单）整合在一起。与仅使用结构化表格EHR数据的基线ML模型相比，多模态模型旨在改善围手术期安全指标的预测。

研究设计

回顾性队列研究。

患者样本

2018年至2023年的5年期间，在四个不同的大型城市学术脊柱中心接受择期或急诊脊柱手术的1898例患者。

结局指标

数值输出介于0和1之间，分别对应于（I）延长住院时间（LOS）、（II）90天再次手术和（III）围手术期重症监护病房（ICU）入院的可能性。

方法

我们预测了以下安全指标：（I）延长住院时间（LOS）、（II）90天再次手术和（III）围手术期重症监护病房（ICU）入院。利用R环境中用于NLP的quanteda包对自由文本EHR输入进行预处理。使用词袋法对精炼后的文本进行分词并转换为数值向量，并与表格EHR数据整合以创建文档特征矩阵。训练了两个极端梯度提升（XGBoost）ML模型：一个仅使用结构化表格EHR数据的基础模型和一个结合多模态模型，该模型利用结构化表格EHR数据与从自由文本NLP输入派生的数值向量。通过网格搜索进行超参数调整，并使用80:20训练/测试分割的10折交叉验证对模型进行验证。为自由文本数据生成词云，并采用可解释人工智能（XAI）技术确定特征重要性。计算的模型性能指标包括接受者操作特征曲线下面积（AUC-ROC）、布里尔评分、校准斜率、校准截距、精度、召回率和F1分数。

结果

2018年1月至2023年9月提取了1898例患者（60.7%为女性），中位年龄为60.0（四分位间距：52.0-68.0），中位体重指数（BMI）为30.3 kg/m²（四分位间距：26.3-34.6）。延长住院时间定义为≥14.4天，占所有个体的10.1%。整个队列的中位住院时间为4.0天（四分位间距：2.0-7.0），90天再次手术率为10.54%，ICU入院率为7.74%。术前表格EHR模型预测围手术期安全指标的AUC范围为0.770至0.779，布里尔评分为0.074至0.099，校准斜率为2.279至2.418。该模型的精度和召回率分别为0.918至0.973和0.988至0.994，F1分数在0.954至0.973之间。结合多模态模型预测围手术期安全指标的AUC范围为0.827至0.903，布里尔评分为0.056至0.083，校准斜率为0.755至1.217。多模态模型的精度为0.909至0.933，召回率为0.979至0.994，F1分数在0.943至0.962之间。重要的表格预测因素包括患者年龄、BMI、血红蛋白水平、白细胞计数、血小板计数以及前后路联合脊柱融合手术方式。重要的自由文本输入包括椎体骨髓炎、神经根病、脊髓病和脊柱转移瘤。

结论

与基线表格模型相比，多模态NLP模型在所有结局指标上均表现出卓越性能。未来的工作包括纳入其他模型维度，如现病史、体格检查和脊柱影像学，并将这些模型临床应用于我们的知情同意和术前优化流程中。