Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA.
University of Cincinnati College of Medicine, Cincinnati, OH 45229, USA.
HGG Adv. 2024 Oct 10;5(4):100341. doi: 10.1016/j.xhgg.2024.100341. Epub 2024 Aug 14.
Rare genetic diseases (RGDs) affect a significant number of individuals, particularly in pediatric populations. This study investigates the efficacy of identifying RGD diagnoses through electronic health records (EHRs) and natural language processing (NLP) tools, and analyzes the prevalence of identified RGDs for potential underdiagnosis at Cincinnati Children's Hospital Medical Center (CCHMC). EHR data from 659,139 pediatric patients at CCHMC were utilized. Diagnoses corresponding to RGDs in Orphanet were identified using rule-based and machine learning-based NLP methods. Manual evaluation assessed the precision of the NLP strategies, with 100 diagnosis descriptions reviewed for each method. The rule-based method achieved a precision of 97.5% (95% CI: 91.5%, 99.4%), while the machine-learning-based method had a precision of 73.5% (95% CI: 63.6%, 81.6%). A manual chart review of 70 randomly selected patients with RGD diagnoses confirmed the diagnoses in 90.3% (95% CI: 82.0%, 95.2%) of cases. A total of 37,326 pediatric patients were identified with 977 RGD diagnoses based on the rule-based method, resulting in a prevalence of 5.66% in this population. While a majority of the disorders showed a higher prevalence at CCHMC compared with Orphanet, some diseases, such as 1p36 deletion syndrome, indicated potential underdiagnosis. Analyses further uncovered disparities in RGD prevalence and age of diagnosis across gender and racial groups. This study demonstrates the utility of employing EHR data with NLP tools to systematically investigate RGD diagnoses in large cohorts. The identified disparities underscore the need for enhanced approaches to guarantee timely and accurate diagnosis and management of pediatric RGDs.
罕见遗传疾病(RGDs)影响大量个体,特别是儿科人群。本研究通过电子健康记录(EHRs)和自然语言处理(NLP)工具来调查识别 RGD 诊断的效果,并分析辛辛那提儿童医院医疗中心(CCHMC)中潜在诊断不足的已识别 RGD 患病率。使用 CCHMC 的 659139 名儿科患者的 EHR 数据。使用基于规则和基于机器学习的 NLP 方法,根据 Orphanet 中的 RGD 诊断来识别诊断。手动评估评估了 NLP 策略的精度,对于每种方法都要审查 100 个诊断描述。基于规则的方法的精度为 97.5%(95%CI:91.5%,99.4%),而基于机器学习的方法的精度为 73.5%(95%CI:63.6%,81.6%)。对 70 名 RGD 诊断随机选择的患者进行手动图表审查,确认了 90.3%(95%CI:82.0%,95.2%)的病例诊断。基于基于规则的方法共识别出 37326 名儿科患者的 977 种 RGD 诊断,在该人群中的患病率为 5.66%。虽然大多数疾病在 CCHMC 中的患病率高于孤儿网,但某些疾病,如 1p36 缺失综合征,表明可能存在诊断不足的情况。分析进一步揭示了性别和种族群体中 RGD 患病率和诊断年龄的差异。本研究证明了使用 EHR 数据和 NLP 工具来系统地研究大样本量的 RGD 诊断的有效性。所确定的差异突显了需要改进方法,以确保及时准确地诊断和管理儿科 RGDs。