State Key Laboratory of Ophthalmology, Clinical Research Center for Ocular Disease, Zhongshan Ophthalmic Centre, Sun Yat-sen University, Guangzhou, China.
School of Public Health, Sun Yat-sen University, Guangzhou, China.
PLoS Med. 2018 Nov 6;15(11):e1002674. doi: 10.1371/journal.pmed.1002674. eCollection 2018 Nov.
Electronic medical records provide large-scale real-world clinical data for use in developing clinical decision systems. However, sophisticated methodology and analytical skills are required to handle the large-scale datasets necessary for the optimisation of prediction accuracy. Myopia is a common cause of vision loss. Current approaches to control myopia progression are effective but have significant side effects. Therefore, identifying those at greatest risk who should undergo targeted therapy is of great clinical importance. The objective of this study was to apply big data and machine learning technology to develop an algorithm that can predict the onset of high myopia, at specific future time points, among Chinese school-aged children.
Real-world clinical refraction data were derived from electronic medical record systems in 8 ophthalmic centres from January 1, 2005, to December 30, 2015. The variables of age, spherical equivalent (SE), and annual progression rate were used to develop an algorithm to predict SE and onset of high myopia (SE ≤ -6.0 dioptres) up to 10 years in the future. Random forest machine learning was used for algorithm training and validation. Electronic medical records from the Zhongshan Ophthalmic Centre (a major tertiary ophthalmic centre in China) were used as the training set. Ten-fold cross-validation and out-of-bag (OOB) methods were applied for internal validation. The remaining 7 independent datasets were used for external validation. Two population-based datasets, which had no participant overlap with the ophthalmic-centre-based datasets, were used for multi-resource validation testing. The main outcomes and measures were the area under the curve (AUC) values for predicting the onset of high myopia over 10 years and the presence of high myopia at 18 years of age. In total, 687,063 multiple visit records (≥3 records) of 129,242 individuals in the ophthalmic-centre-based electronic medical record databases and 17,113 follow-up records of 3,215 participants in population-based cohorts were included in the analysis. Our algorithm accurately predicted the presence of high myopia in internal validation (the AUC ranged from 0.903 to 0.986 for 3 years, 0.875 to 0.901 for 5 years, and 0.852 to 0.888 for 8 years), external validation (the AUC ranged from 0.874 to 0.976 for 3 years, 0.847 to 0.921 for 5 years, and 0.802 to 0.886 for 8 years), and multi-resource testing (the AUC ranged from 0.752 to 0.869 for 4 years). With respect to the prediction of high myopia development by 18 years of age, as a surrogate of high myopia in adulthood, the algorithm provided clinically acceptable accuracy over 3 years (the AUC ranged from 0.940 to 0.985), 5 years (the AUC ranged from 0.856 to 0.901), and even 8 years (the AUC ranged from 0.801 to 0.837). Meanwhile, our algorithm achieved clinically acceptable prediction of the actual refraction values at future time points, which is supported by the regressive performance and calibration curves. Although the algorithm achieved balanced and robust performance, concerns about the compromised quality of real-world clinical data and over-fitting issues should be cautiously considered.
To our knowledge, this study, for the first time, used large-scale data collected from electronic health records to demonstrate the contribution of big data and machine learning approaches to improved prediction of myopia prognosis in Chinese school-aged children. This work provides evidence for transforming clinical practice, health policy-making, and precise individualised interventions regarding the practical control of school-aged myopia.
电子病历为开发临床决策系统提供了大规模的真实世界临床数据。然而,为了优化预测精度,需要使用复杂的方法和分析技能来处理必要的大规模数据集。近视是导致视力丧失的常见原因。目前控制近视进展的方法虽然有效,但有显著的副作用。因此,确定那些风险最大、应该接受靶向治疗的人具有重要的临床意义。本研究的目的是应用大数据和机器学习技术,开发一种算法,以预测中国学龄儿童在特定未来时间点发生高度近视的时间。
从 2005 年 1 月 1 日至 2015 年 12 月 30 日,从 8 个眼科中心的电子病历系统中提取真实世界的临床折射数据。使用年龄、球镜等效(SE)和年进展率等变量来开发一种算法,以预测未来 10 年内 SE 和高度近视(SE≤-6.0 屈光度)的发生。随机森林机器学习用于算法训练和验证。中山大学眼科中心(中国主要的三级眼科中心)的电子病历被用作训练集。应用 10 倍交叉验证和袋外(OOB)方法进行内部验证。其余 7 个独立数据集用于外部验证。使用两个基于人群的数据集进行多资源验证测试,这两个数据集与眼科中心的数据集没有重叠。主要结局和测量指标是预测未来 10 年内高度近视发生的曲线下面积(AUC)值和 18 岁时存在高度近视的情况。在眼科中心的电子病历数据库中,共有 687063 个≥3 次就诊记录(129242 人)和人群队列中 17113 个随访记录(3215 人)被纳入分析。我们的算法在内部验证中准确地预测了高度近视的存在(3 年的 AUC 范围为 0.903 至 0.986,5 年的 AUC 范围为 0.875 至 0.901,8 年的 AUC 范围为 0.852 至 0.888)、外部验证(3 年的 AUC 范围为 0.874 至 0.976,5 年的 AUC 范围为 0.847 至 0.921,8 年的 AUC 范围为 0.802 至 0.886)和多资源测试(4 年的 AUC 范围为 0.752 至 0.869)。关于到 18 岁时高度近视发展的预测,作为成年高度近视的替代指标,该算法在 3 年(AUC 范围为 0.940 至 0.985)、5 年(AUC 范围为 0.856 至 0.901)甚至 8 年(AUC 范围为 0.801 至 0.837)的时间内提供了可接受的临床准确性。同时,我们的算法在预测未来时间点的实际折射值方面也达到了可接受的精度,这一点得到了回归性能和校准曲线的支持。尽管该算法实现了平衡和稳健的性能,但应谨慎考虑对真实世界临床数据质量的影响和过拟合问题。
据我们所知,本研究首次使用从电子健康记录中收集的大规模数据,证明了大数据和机器学习方法在提高中国学龄儿童近视预后预测方面的贡献。这项工作为临床实践、卫生政策制定和针对学龄儿童近视的精确个体化干预提供了证据。