Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.
J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.
High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity).
ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB).
ssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average.
ssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software.
When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
高通量表型分析将加速电子健康记录(EHR)在转化研究中的应用。一个关键的障碍是表型分析算法(PA)估计和评估所需的广泛医疗监督。为了解决这一挑战,已经提出了许多弱监督学习方法。然而,当只有一小部分数据被标记时,很少有方法可以可靠地评估 PA 的预测性能。为了填补这一空白,我们引入了一种半监督方法(ssROC)来估计 PA 的接收者操作特征(ROC)参数(例如,灵敏度、特异性)。
ssROC 使用一个小的标记数据集进行非参数化的缺失标签推断。然后,这些推断用于 ROC 参数估计,以产生比仅使用标记数据的经典监督 ROC 分析(supROC)更精确的 PA 性能估计。我们使用来自麻省总医院(MGB)的合成、半合成和 EHR 数据评估了 ssROC。
ssROC 在模拟和半合成数据中产生了具有最小偏差和显著更低方差的 ROC 参数估计,比 supROC 更精确。对于来自 MGB 的 5 个 PA,ssROC 的估计值平均比 supROC 变化小 30%至 60%。
ssROC 可以在不需要大量标记数据的情况下,精确评估 PA 的性能。ssROC 也可以在开源 R 软件中轻松实现。
当与弱监督 PA 一起使用时,ssROC 有助于实现基于 EHR 的研究所需的可靠和简化的表型分析。