Mathematical and Computational Science Program, Stanford University, Stanford, CA, USA.
Department of Medicine, Stanford University, Stanford, CA, USA.
J Biomed Inform. 2018 Oct;86:109-119. doi: 10.1016/j.jbi.2018.09.005. Epub 2018 Sep 7.
Evaluate the quality of clinical order practice patterns machine-learned from clinician cohorts stratified by patient mortality outcomes.
Inpatient electronic health records from 2010 to 2013 were extracted from a tertiary academic hospital. Clinicians (n = 1822) were stratified into low-mortality (21.8%, n = 397) and high-mortality (6.0%, n = 110) extremes using a two-sided P-value score quantifying deviation of observed vs. expected 30-day patient mortality rates. Three patient cohorts were assembled: patients seen by low-mortality clinicians, high-mortality clinicians, and an unfiltered crowd of all clinicians (n = 1046, 1046, and 5230 post-propensity score matching, respectively). Predicted order lists were automatically generated from recommender system algorithms trained on each patient cohort and evaluated against (i) real-world practice patterns reflected in patient cases with better-than-expected mortality outcomes and (ii) reference standards derived from clinical practice guidelines.
Across six common admission diagnoses, order lists learned from the crowd demonstrated the greatest alignment with guideline references (AUROC range = 0.86-0.91), performing on par or better than those learned from low-mortality clinicians (0.79-0.84, P < 10) or manually-authored hospital order sets (0.65-0.77, P < 10). The same trend was observed in evaluating model predictions against better-than-expected patient cases, with the crowd model (AUROC mean = 0.91) outperforming the low-mortality model (0.87, P < 10) and order set benchmarks (0.78, P < 10).
Whether machine-learning models are trained on all clinicians or a subset of experts illustrates a bias-variance tradeoff in data usage. Defining robust metrics to assess quality based on internal (e.g. practice patterns from better-than-expected patient cases) or external reference standards (e.g. clinical practice guidelines) is critical to assess decision support content.
Learning relevant decision support content from all clinicians is as, if not more, robust than learning from a select subgroup of clinicians favored by patient outcomes.
评估根据患者死亡率分层的临床医生队列中机器学习得出的临床医嘱实践模式的质量。
从一家三级学术医院提取 2010 年至 2013 年的住院电子健康记录。使用一种量化观察到的 30 天患者死亡率与预期死亡率偏差的双侧 P 值评分,将临床医生(n=1822)分为低死亡率(21.8%,n=397)和高死亡率(6.0%,n=110)两个极端。使用三种患者队列:低死亡率医生、高死亡率医生和未过滤的所有医生队列(n=1046、1046 和 5230 名经倾向评分匹配的患者)。从每个患者队列的推荐系统算法中自动生成预测医嘱列表,并将其与(i)死亡率高于预期的患者病例中反映的真实实践模式和(ii)从临床实践指南中获得的参考标准进行比较。
在六个常见入院诊断中,从人群中学习到的医嘱列表与指南参考最一致(AUROC 范围为 0.86-0.91),与从低死亡率医生(0.79-0.84,P<10)或手工编写的医院医嘱集(0.65-0.77,P<10)学习到的医嘱列表相比,表现相当或更好。在评估模型对死亡率高于预期的患者病例的预测时,也观察到了相同的趋势,人群模型(AUROC 平均值=0.91)优于低死亡率模型(0.87,P<10)和医嘱集基准(0.78,P<10)。
无论机器学习模型是基于所有医生还是一组专家进行训练,都说明了数据使用中的偏差-方差权衡。基于内部(例如死亡率高于预期的患者病例的实践模式)或外部参考标准(例如临床实践指南)定义稳健的评估决策支持内容的质量指标至关重要。
从所有医生中学习相关的决策支持内容与从患者结局偏好的精选临床医生亚组中学习一样有效,甚至更有效。