Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada.
Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
BMC Med Res Methodol. 2024 Jul 13;24(1):147. doi: 10.1186/s12874-024-02273-8.
Decision analytic models and meta-analyses often rely on survival probabilities that are digitized from published Kaplan-Meier (KM) curves. However, manually extracting these probabilities from KM curves is time-consuming, expensive, and error-prone. We developed an efficient and accurate algorithm that automates extraction of survival probabilities from KM curves.
The automated digitization algorithm processes images from a JPG or PNG format, converts them in their hue, saturation, and lightness scale and uses optical character recognition to detect axis location and labels. It also uses a k-medoids clustering algorithm to separate multiple overlapping curves on the same figure. To validate performance, we generated survival plots form random time-to-event data from a sample size of 25, 50, 150, and 250, 1000 individuals split into 1,2, or 3 treatment arms. We assumed an exponential distribution and applied random censoring. We compared automated digitization and manual digitization performed by well-trained researchers. We calculated the root mean squared error (RMSE) at 100-time points for both methods. The algorithm's performance was also evaluated by Bland-Altman analysis for the agreement between automated and manual digitization on a real-world set of published KM curves.
The automated digitizer accurately identified survival probabilities over time in the simulated KM curves. The average RMSE for automated digitization was 0.012, while manual digitization had an average RMSE of 0.014. Its performance was negatively correlated with the number of curves in a figure and the presence of censoring markers. In real-world scenarios, automated digitization and manual digitization showed very close agreement.
The algorithm streamlines the digitization process and requires minimal user input. It effectively digitized KM curves in simulated and real-world scenarios, demonstrating accuracy comparable to conventional manual digitization. The algorithm has been developed as an open-source R package and as a Shiny application and is available on GitHub: https://github.com/Pechli-Lab/SurvdigitizeR and https://pechlilab.shinyapps.io/SurvdigitizeR/ .
决策分析模型和荟萃分析通常依赖于从已发表的 Kaplan-Meier(KM)曲线数字化的生存概率。然而,从 KM 曲线手动提取这些概率既耗时、昂贵又容易出错。我们开发了一种高效准确的算法,可自动从 KM 曲线中提取生存概率。
自动化数字化算法处理来自 JPG 或 PNG 格式的图像,将其转换为色调、饱和度和亮度尺度,并使用光学字符识别来检测轴的位置和标签。它还使用 k-中心点聚类算法来分离同一图形上的多个重叠曲线。为了验证性能,我们从样本量为 25、50、150 和 250、1000 的随机事件时间数据生成生存图,这些个体分为 1、2 或 3 个治疗组。我们假设了指数分布并应用了随机删失。我们比较了由训练有素的研究人员进行的自动数字化和手动数字化。我们计算了两种方法在 100 个时间点的均方根误差(RMSE)。还通过 Bland-Altman 分析评估了算法在真实出版的 KM 曲线上自动和手动数字化之间的一致性。
自动数字化器在模拟的 KM 曲线中准确地识别了随时间变化的生存概率。自动数字化的平均 RMSE 为 0.012,而手动数字化的平均 RMSE 为 0.014。其性能与图形中的曲线数量和存在的删失标记呈负相关。在真实场景中,自动数字化和手动数字化显示出非常接近的一致性。
该算法简化了数字化过程,仅需最少的用户输入。它在模拟和真实场景中有效地数字化了 KM 曲线,显示出与传统手动数字化相当的准确性。该算法已作为开源 R 包和 Shiny 应用程序开发,并可在 GitHub 上获得:https://github.com/Pechli-Lab/SurvdigitizeR 和 https://pechlilab.shinyapps.io/SurvdigitizeR/。