Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, California.
Division of Hospital Medicine, Department of Medicine, Stanford University, Stanford, California.
JAMA Netw Open. 2019 Sep 4;2(9):e1910967. doi: 10.1001/jamanetworkopen.2019.10967.
Laboratory testing is an important target for high-value care initiatives, constituting the highest volume of medical procedures. Prior studies have found that up to half of all inpatient laboratory tests may be medically unnecessary, but a systematic method to identify these unnecessary tests in individual cases is lacking.
To systematically identify low-yield inpatient laboratory testing through personalized predictions.
DESIGN, SETTING, AND PARTICIPANTS: In this retrospective diagnostic study with multivariable prediction models, 116 637 inpatients treated at Stanford University Hospital from January 1, 2008, to December 31, 2017, a total of 60 929 inpatients treated at University of Michigan from January 1, 2015, to December 31, 2018, and 13 940 inpatients treated at the University of California, San Francisco from January 1 to December 31, 2018, were assessed.
Diagnostic accuracy measures, including sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and area under the receiver operating characteristic curve (AUROC), of machine learning models when predicting whether inpatient laboratory tests yield a normal result as defined by local laboratory reference ranges.
In the recent data sets (July 1, 2014, to June 30, 2017) from Stanford University Hospital (including 22 664 female inpatients with a mean [SD] age of 58.8 [19.0] years and 22 016 male inpatients with a mean [SD] age of 59.0 [18.1] years), among the top 20 highest-volume tests, 792 397 were repeats of orders within 24 hours, including tests that are physiologically unlikely to yield new information that quickly (eg, white blood cell differential, glycated hemoglobin, and serum albumin level). The best-performing machine learning models predicted normal results with an AUROC of 0.90 or greater for 12 stand-alone laboratory tests (eg, sodium AUROC, 0.92 [95% CI, 0.91-0.93]; sensitivity, 98%; specificity, 35%; PPV, 66%; NPV, 93%; lactate dehydrogenase AUROC, 0.93 [95% CI, 0.93-0.94]; sensitivity, 96%; specificity, 65%; PPV, 71%; NPV, 95%; and troponin I AUROC, 0.92 [95% CI, 0.91-0.93]; sensitivity, 88%; specificity, 79%; PPV, 67%; NPV, 93%) and 10 common laboratory test components (eg, hemoglobin AUROC, 0.94 [95% CI, 0.92-0.95]; sensitivity, 99%; specificity, 17%; PPV, 90%; NPV, 81%; creatinine AUROC, 0.96 [95% CI, 0.96-0.97]; sensitivity, 93%; specificity, 83%; PPV, 79%; NPV, 94%; and urea nitrogen AUROC, 0.95 [95% CI, 0.94, 0.96]; sensitivity, 87%; specificity, 89%; PPV, 77%; NPV 94%).
The findings suggest that low-yield diagnostic testing is common and can be systematically identified through data-driven methods and patient context-aware predictions. Implementing machine learning models appear to be able to quantify the level of uncertainty and expected information gained from diagnostic tests explicitly, with the potential to encourage useful testing and discourage low-value testing that incurs direct costs and indirect harms.
实验室检测是高价值医疗保健计划的一个重要目标,构成了最高数量的医疗程序。先前的研究发现,多达一半的住院患者实验室检测可能是不必要的,但缺乏系统的方法来识别个别病例中的这些不必要的检测。
通过个性化预测系统地识别低产住院实验室检测。
设计、设置和参与者:在这项回顾性诊断研究中,使用多变量预测模型,对斯坦福大学医院 2008 年 1 月 1 日至 2017 年 12 月 31 日期间的 116637 名住院患者、密歇根大学 2015 年 1 月 1 日至 2018 年 12 月 31 日期间的 60929 名住院患者和加州大学旧金山分校 2018 年 1 月 1 日至 12 月 31 日期间的 13940 名住院患者进行了评估。
机器学习模型预测住院患者实验室检测结果是否正常的诊断准确性指标,包括敏感性、特异性、阴性预测值(NPV)、阳性预测值(PPV)和接收器操作特征曲线(ROC)下的面积(AUROC),实验室参考范围定义为正常结果。
在斯坦福大学医院的最新数据集(2014 年 7 月 1 日至 2017 年 6 月 30 日)中(包括 22664 名女性住院患者,平均年龄[标准差]为 58.8[19.0]岁和 22016 名男性住院患者,平均年龄[标准差]为 59.0[18.1]岁),在 20 项最高产的检测中,有 792397 次是 24 小时内的重复订单,包括那些不太可能迅速产生新信息的检测(例如,白细胞分类计数、糖化血红蛋白和血清白蛋白水平)。表现最好的机器学习模型对 12 项独立的实验室检测预测正常结果的 AUROC 为 0.90 或更高(例如,钠 AUROC,0.92[95%CI,0.91-0.93];敏感性,98%;特异性,35%;PPV,66%;NPV,93%;乳酸脱氢酶 AUROC,0.93[95%CI,0.93-0.94];敏感性,96%;特异性,65%;PPV,71%;NPV,95%;肌钙蛋白 I AUROC,0.92[95%CI,0.91-0.93];敏感性,88%;特异性,79%;PPV,67%;NPV,93%)和 10 项常见的实验室检测成分(例如,血红蛋白 AUROC,0.94[95%CI,0.92-0.95];敏感性,99%;特异性,17%;PPV,90%;NPV,81%;肌酐 AUROC,0.96[95%CI,0.96-0.97];敏感性,93%;特异性,83%;PPV,79%;NPV,94%;和尿素氮 AUROC,0.95[95%CI,0.94,0.96];敏感性,87%;特异性,89%;PPV,77%;NPV,94%)。
研究结果表明,低产诊断检测很常见,可以通过数据驱动的方法和患者上下文感知预测来系统地识别。实施机器学习模型似乎能够明确量化从诊断测试中获得的不确定性和预期信息的水平,有可能鼓励有用的测试,并劝阻产生直接成本和间接伤害的低价值测试。