Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea.
Division of Biomedical Informatics, Seoul National University Biomedical Informatics and Systems Biomedical Informatics Research Center, Seoul National University College of Medicine, Seoul, Korea.
J Korean Med Sci. 2020 Mar 30;35(12):e78. doi: 10.3346/jkms.2020.35.e78.
Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information.
We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation.
Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892-0.999 precision and 0.795-0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules.
The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them.
人类白细胞抗原(HLA)分型对于移植患者非常重要,可以防止严重的不匹配反应,其结果还可以支持各种疾病的诊断或预测药物副作用。然而,由于 HLA 分型结果通常以电子病历中的自由文本格式或 PDF 形式提供,因此这些次要应用受到限制。我们在此提出一种方法,通过提取血清型/等位基因信息,将存储在非结构化格式中的 HLA 基因型信息转换为可重复使用的结构化格式。
我们从 2000 年至 2018 年,从首尔国立大学医院(SUPPREME)的临床数据仓库中查询 HLA 分型报告作为规则开发数据集(64024 份报告),并从最近一年(6181 份报告)作为测试集。我们使用基于规则的自然语言方法,使用 Python regex 函数提取以下信息:1)报告中的患者数量;2)HLA 测试的临床特征,如测试指征;3)精确的 HLA 基因型。通过将测试集中提取的结果与通过手动策展生成的验证集进行比较,评估规则和代码的性能。
在开发数据集的 11287 份和测试数据集的 1107 份描述单个患者 HLA 分型的报告中,迭代规则生成了 124 个提取规则和 8 个 HLA 基因型清洗规则。应用这些规则提取 HLA 基因型的精度为 0.892-0.999,召回率为 0.795-0.998,适用于五个 HLA 基因。报告中患者数量的提取规则的精度和召回率分别为 0.997 和 0.994,临床变量提取的规则的精度和召回率分别为 0.997 和 0.992。所有提取的 HLA 等位基因和血清型均根据正式的 HLA 命名法通过清洗规则进行转换。
基于规则的 HLA 基因型提取方法具有可靠的准确性。我们相信,当这些未充分利用的遗传信息返还给患者时,会有大量患者从中受益。