Yadaw Arjun S, Sid Eric, Sidky Hythem, Zeng Chenjie, Zhu Qian, Mathé Ewy A
Division of Preclinical Innovation, National Center for Advancing Translational Sciences (NCATS), NIH, Rockville, MD, USA.
Division of Rare Diseases Research Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, Bethesda, MD 20892, USA.
medRxiv. 2025 May 6:2025.05.02.25325348. doi: 10.1101/2025.05.02.25325348.
Identifying rare disease (RD) patients in electronic health records (EHR) is challenging, as more than 10,000 rare diseases are not typically captured by clinical coding systems. This limits the assessment of clinical outcomes for RD patients. This study introduces a semiautomated approach to map RDs to appropriate codes, that is applicable across various EHR systems. By improving RD patient identification, this method facilitates the analysis of clinical outcomes and disease severity in the RD population. We exemplify this by utilizing large EHR datasets such as those in the National COVID Cohort Collaborative (N3C) with over 21 million patients.
We developed a semiautomated workflow to enumerate RD-specific SNOMED-CT and ICD-10 codes, starting with 12,003 GARD IDs mapped to ORPHANET. This process linked RDs to SNOMED-CT and ICD-10 codes, applying exclusion criteria based on group of disorders. We created an extensive list of SNOMED-CT codes with descendants from the OHDSI atlas and performed phenotype filtering, removing irrelevant codes. The final list included 12,081 SNOMED-CT codes and 357 ICD-10 codes for further analysis, enabling the identification and mapping of rare diseases in EHR.
Our semiautomated workflow identified 357 RD-specific ICD-10 codes and 12,081 SNOMED-CT codes representing 6,342 RDs which are categorized into 30 Orphanet linearization classes. We exemplify the utility of these codes by performing a preliminary univariate analysis of COVID-19 outcomes in a large cohort of 4,835,718 COVID-19 positive individuals in N3C, of which 404,735 (8.37%) were identified as having preexisting RD. The mortality and hospitalization risk ratios for rare RD classes ranged from 0.23 - 5.28 and 0.93 - 3.13, respectively (p-values <0.001).
Our systematic and automated workflow enables rapid identification of rare disease patients across diverse EHR systems. We demonstrate its utility by evaluating COVID-19 severity outcomes by rare disease classes in the N3C cohort. These findings support the need for targeted preventive healthcare interventions and highlight the potential for future research on long COVID, COVID-19 reinfection, and other outcomes in the rare disease population.
在电子健康记录(EHR)中识别罕见病(RD)患者具有挑战性,因为超过10000种罕见病通常未被临床编码系统收录。这限制了对RD患者临床结局的评估。本研究引入了一种半自动方法,将罕见病映射到适当的编码,该方法适用于各种EHR系统。通过改进RD患者识别,此方法有助于分析RD人群的临床结局和疾病严重程度。我们通过利用大型EHR数据集(如国家新冠队列协作组(N3C)中超过2100万患者的数据集)来举例说明这一点。
我们开发了一种半自动工作流程,以枚举特定于罕见病的SNOMED-CT和ICD-10编码,从映射到《孤儿病数据库》的12003个全球遗传疾病注册中心(GARD)ID开始。此过程将罕见病与SNOMED-CT和ICD-10编码相链接,并根据疾病组应用排除标准。我们创建了一个包含来自观察医疗效果合作组织(OHDSI)图谱后代的SNOMED-CT编码的广泛列表,并进行了表型筛选,去除不相关的编码。最终列表包括12081个SNOMED-CT编码和357个ICD-10编码以供进一步分析,从而能够在EHR中识别和映射罕见病。
我们的半自动工作流程识别出357个特定于罕见病的ICD-10编码和12081个SNOMED-CT编码,代表6342种罕见病,这些罕见病被分类为30个《孤儿病数据库》线性化类别。我们通过对N3C中4835718名新冠阳性个体的大型队列进行新冠结局的初步单变量分析来举例说明这些编码的效用,其中404735人(8.37%)被确定患有既往罕见病。罕见病类别的死亡率和住院风险比分别为0.23 - 5.28和0.93 - 3.13(p值<0.001)。
我们的系统且自动化的工作流程能够在不同的EHR系统中快速识别罕见病患者。我们通过在N3C队列中按罕见病类别评估新冠严重程度结局来证明其效用。这些发现支持了针对性预防保健干预的必要性,并突出了未来对罕见病患者群体中长新冠、新冠再感染及其他结局进行研究的潜力。