Suppr超能文献

利用国家外科手术数据库预测腰椎后路手术后的并发症,并比较曲线下面积和 F1 评分评估预测能力。

Using a national surgical database to predict complications following posterior lumbar surgery and comparing the area under the curve and F1-score for the assessment of prognostic capability.

机构信息

Ottawa Spine Collaborative Analytics Network, The Ottawa Hospital, Ottawa, ON, Canada K1Y 4E9.

Ottawa Spine Collaborative Analytics Network, The Ottawa Hospital, Ottawa, ON, Canada K1Y 4E9; Ottawa Hospital Research Institute, Ottawa, ON, Canada K1Y 4E9.

出版信息

Spine J. 2021 Jul;21(7):1135-1142. doi: 10.1016/j.spinee.2021.02.007. Epub 2021 Feb 16.

Abstract

BACKGROUND

With spinal surgery rates increasing in North America, models that are able to accurately predict which patients are at greater risk of developing complications are highly warranted. However, the previously published methods which have used large, multi-centre databases to develop their prediction models have relied on the receiver operator characteristics curve with the associated area under the curve (AUC) to assess their model's performance. Recently, it has been found that a precision-recall curve with the associated F1-score could provide a more realistic analysis for these models.

PURPOSE

To develop a logistic regression (LR) model for the prediction of complications following posterior lumbar spine surgery and to then assess for any difference in performance of the model when using the AUC versus the F1-score.

STUDY DESIGN

Retrospective review of a prospective cohort.

PATIENT SAMPLE

The American College of Surgeons National Surgical Quality Improvement Program (NSQIP) registry was used. All patients that underwent posterior lumbar spine surgery between 2005 to 2016 with appropriate data were included.

OUTCOME MEASURES

Both the AUC and F1-score were utilized to assess the prognostic performance of the prediction model.

METHODS

In order to develop the LR model used to predict a complication during or following spine surgery, 19 variables were selected by three orthopedic spine surgeons from the NSQIP registry. Two datasets were developed for this analysis: (1) an imbalanced dataset, which was taken directly from the NSQIP registry, and (2) a down-sampled set. The purpose of the down-sampled set was to balance the data in order to evaluate whether balancing the data had an effect on model performance. The AUC and F1-score were applied to both of these datasets.

RESULTS

Within the NSQIP database, 52,787 spine surgery cases were identified of which only 10% of these cases had complications during surgery. Applying the LR model showed a large difference between the AUC (0.69) and the F1 score (0.075) on the imbalanced dataset. However, no major differences existed between the AUC and F1-score when the data was balanced and the LR model was reapplied (0.69 and 0.62, AUC and F1-score, respectively).

CONCLUSIONS

The F1-score detected a drastically lower performance for the prediction of complications when using the imbalanced data, but detected a performance similar to the AUC level when balancing techniques were utilized for the dataset. This difference is due to a low precision score when many false positive classifications are present, which is not identified when using the AUC value. This lowers the utility of the AUC score, as many of the datasets used in medicine are imbalanced. Therefore, we recommend using the F1-score on large, prospective databases when the data is imbalanced with a large amount of true negative classifications.

摘要

背景

随着北美地区脊柱手术数量的增加,人们迫切需要能够准确预测哪些患者发生并发症风险较高的模型。然而,以前发表的方法使用大型多中心数据库来开发预测模型,这些方法依赖于接收器操作特征曲线及其相关曲线下面积(AUC)来评估模型的性能。最近发现,使用精度-召回率曲线及其相关 F1 分数可以为这些模型提供更现实的分析。

目的

建立用于预测后路腰椎脊柱手术后并发症的逻辑回归(LR)模型,并评估使用 AUC 与 F1 分数对模型性能的影响。

研究设计

前瞻性队列的回顾性研究。

患者样本

使用美国外科医师学会国家外科质量改进计划(NSQIP)登记处。纳入 2005 年至 2016 年间接受后路腰椎脊柱手术且数据完整的所有患者。

结局指标

使用 AUC 和 F1 分数评估预测模型的预后性能。

方法

为了开发用于预测脊柱手术后或手术期间发生并发症的 LR 模型,三位骨科脊柱外科医生从 NSQIP 登记处选择了 19 个变量。本分析采用了两个数据集:(1)直接取自 NSQIP 登记处的不平衡数据集,(2)降采样集。降采样集的目的是平衡数据,以评估数据平衡是否对模型性能产生影响。对这两个数据集应用 AUC 和 F1 分数。

结果

在 NSQIP 数据库中,确定了 52787 例脊柱手术病例,其中仅 10%的病例在手术过程中发生并发症。应用 LR 模型显示,在不平衡数据集上,AUC(0.69)和 F1 分数(0.075)之间存在较大差异。然而,当数据平衡且重新应用 LR 模型时,AUC 和 F1 分数之间没有明显差异(分别为 0.69 和 0.62)。

结论

当使用不平衡数据时,F1 分数检测到预测并发症的性能大幅下降,但当对数据集使用平衡技术时,F1 分数检测到与 AUC 水平相似的性能。这种差异是由于存在大量假阳性分类时精度得分较低,而 AUC 值并未识别出这种情况。这降低了 AUC 分数的实用性,因为医学中使用的许多数据集都是不平衡的。因此,当数据不平衡且存在大量真实阴性分类时,我们建议在大型前瞻性数据库上使用 F1 分数。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验