Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
Perelman School of Medicine, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.
Pediatr Blood Cancer. 2023 May;70(5):e30260. doi: 10.1002/pbc.30260. Epub 2023 Feb 23.
Administrative datasets are useful for identifying rare disease cohorts such as pediatric acute myeloid leukemia (AML). Previously, cohorts were assembled using labor-intensive, manual reviews of patients' longitudinal chemotherapy data.
We utilized a two-step machine learning (ML) method to (i) identify pediatric patients with newly diagnosed AML, and (ii) among the identified AML patients, their chemotherapy courses, in an administrative/billing database. Using 2558 patients previously manually reviewed, multiple ML algorithms were derived from 75% of the study sample, and the selected model was tested in the remaining hold-out sample. The selected model was also applied to assemble a new pediatric AML cohort and further assessed in an external validation, using a standalone cohort established by manual chart abstraction.
For patient identification, the selected Support Vector Machine model yielded a sensitivity of 0.97 and a positive predictive value (PPV) of 0.97 in the hold-out test sample. For course-specific chemotherapy regimen and start date identification, the selected Random Forest model yielded overall PPV greater than or equal to 0.88 and sensitivity greater than or equal to 0.86 across all courses in the test sample. When applied to new cohort assembly, ML identified 3016 AML patients with 10,588 treatment courses. In the external validation subset, PPV was greater than or equal to 0.75 and sensitivity was greater than or equal to 0.82 for patient identification, and PPV was greater than or equal to 0.93 and sensitivity was greater than or equal to 0.94 for regimen identifications.
A carefully designed ML model can accurately identify pediatric AML patients and their chemotherapy courses from administrative databases. This approach may be generalizable to other diseases and databases.
行政数据集可用于识别儿科急性髓细胞白血病 (AML) 等罕见疾病队列。此前,队列是通过对患者的纵向化疗数据进行人工审查来建立的。
我们利用两步机器学习 (ML) 方法,(i)从行政/计费数据库中识别新诊断为 AML 的儿科患者,以及 (ii) 识别出 AML 患者后,识别其化疗疗程。在之前手动审查的 2558 名患者中,从研究样本的 75%中得出了多种 ML 算法,并在其余的保留样本中测试了选定的模型。选择的模型还用于组装一个新的儿科 AML 队列,并在使用手动图表抽象建立的独立队列的外部验证中进一步评估。
对于患者识别,选定的支持向量机模型在保留测试样本中的敏感性为 0.97,阳性预测值 (PPV) 为 0.97。对于特定于课程的化疗方案和开始日期的识别,选定的随机森林模型在测试样本中所有课程的总体 PPV 均大于或等于 0.88,且敏感性均大于或等于 0.86。当应用于新的队列组装时,ML 确定了 3016 名 AML 患者,共 10588 个治疗疗程。在外部验证子集中,患者识别的 PPV 大于或等于 0.75,敏感性大于或等于 0.82,方案识别的 PPV 大于或等于 0.93,敏感性大于或等于 0.94。
精心设计的 ML 模型可以从行政数据库中准确识别儿科 AML 患者及其化疗疗程。这种方法可能适用于其他疾病和数据库。