• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于规则的方法从 HLA 报告的自由文本中提取结构化基因型信息。

Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach.

机构信息

Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea.

Division of Biomedical Informatics, Seoul National University Biomedical Informatics and Systems Biomedical Informatics Research Center, Seoul National University College of Medicine, Seoul, Korea.

出版信息

J Korean Med Sci. 2020 Mar 30;35(12):e78. doi: 10.3346/jkms.2020.35.e78.

DOI:10.3346/jkms.2020.35.e78
PMID:32233158
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7105511/
Abstract

BACKGROUND

Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information.

METHODS

We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation.

RESULTS

Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892-0.999 precision and 0.795-0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules.

CONCLUSION

The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them.

摘要

背景

人类白细胞抗原(HLA)分型对于移植患者非常重要,可以防止严重的不匹配反应,其结果还可以支持各种疾病的诊断或预测药物副作用。然而,由于 HLA 分型结果通常以电子病历中的自由文本格式或 PDF 形式提供,因此这些次要应用受到限制。我们在此提出一种方法,通过提取血清型/等位基因信息,将存储在非结构化格式中的 HLA 基因型信息转换为可重复使用的结构化格式。

方法

我们从 2000 年至 2018 年,从首尔国立大学医院(SUPPREME)的临床数据仓库中查询 HLA 分型报告作为规则开发数据集(64024 份报告),并从最近一年(6181 份报告)作为测试集。我们使用基于规则的自然语言方法,使用 Python regex 函数提取以下信息:1)报告中的患者数量;2)HLA 测试的临床特征,如测试指征;3)精确的 HLA 基因型。通过将测试集中提取的结果与通过手动策展生成的验证集进行比较,评估规则和代码的性能。

结果

在开发数据集的 11287 份和测试数据集的 1107 份描述单个患者 HLA 分型的报告中,迭代规则生成了 124 个提取规则和 8 个 HLA 基因型清洗规则。应用这些规则提取 HLA 基因型的精度为 0.892-0.999,召回率为 0.795-0.998,适用于五个 HLA 基因。报告中患者数量的提取规则的精度和召回率分别为 0.997 和 0.994,临床变量提取的规则的精度和召回率分别为 0.997 和 0.992。所有提取的 HLA 等位基因和血清型均根据正式的 HLA 命名法通过清洗规则进行转换。

结论

基于规则的 HLA 基因型提取方法具有可靠的准确性。我们相信,当这些未充分利用的遗传信息返还给患者时,会有大量患者从中受益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/a3a41c543f0c/jkms-35-e78-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/7ae67ed0a126/jkms-35-e78-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/d0ea84d8c632/jkms-35-e78-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/a3a41c543f0c/jkms-35-e78-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/7ae67ed0a126/jkms-35-e78-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/d0ea84d8c632/jkms-35-e78-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6b37/7105511/a3a41c543f0c/jkms-35-e78-g003.jpg

相似文献

1
Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach.基于规则的方法从 HLA 报告的自由文本中提取结构化基因型信息。
J Korean Med Sci. 2020 Mar 30;35(12):e78. doi: 10.3346/jkms.2020.35.e78.
2
Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术,这些管道处理非结构化和半结构化的医学文本。
Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.
3
[A customized method for information extraction from unstructured text data in the electronic medical records].[一种从电子病历非结构化文本数据中提取信息的定制方法]
Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):256-263.
4
Genotype List String: a grammar for describing HLA and KIR genotyping results in a text string.基因型列表字符串:一种用于在文本字符串中描述HLA和KIR基因分型结果的语法。
Tissue Antigens. 2013 Aug;82(2):106-12. doi: 10.1111/tan.12150.
5
A method for cohort selection of cardiovascular disease records from an electronic health record system.一种从电子健康记录系统中选择心血管疾病记录队列的方法。
Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30.
6
Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system.利用自然语言处理从非结构化临床信件中提取结构化癫痫数据:ExECT(癫痫临床文本提取)系统的开发和验证。
BMJ Open. 2019 Apr 1;9(4):e023232. doi: 10.1136/bmjopen-2018-023232.
7
Extracting information from the text of electronic medical records to improve case detection: a systematic review.从电子病历文本中提取信息以改善病例检测:一项系统综述
J Am Med Inform Assoc. 2016 Sep;23(5):1007-15. doi: 10.1093/jamia/ocv180. Epub 2016 Feb 5.
8
Facilitating clinical research through automation: Combining optical character recognition with natural language processing.通过自动化促进临床研究:结合光学字符识别和自然语言处理。
Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.
9
Next-generation HLA typing of 382 International Histocompatibility Working Group reference B-lymphoblastoid cell lines: Report from the 17th International HLA and Immunogenetics Workshop.382 个国际组织相容性工作组参考 B 淋巴细胞系的下一代 HLA 分型:第 17 届国际 HLA 和免疫遗传学研讨会报告。
Hum Immunol. 2019 Jul;80(7):449-460. doi: 10.1016/j.humimm.2019.03.001. Epub 2019 Mar 4.
10
Extracting laboratory test information from paper-based reports.从纸质报告中提取实验室检测信息。
BMC Med Inform Decis Mak. 2023 Nov 6;23(1):251. doi: 10.1186/s12911-023-02346-6.

引用本文的文献

1
Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.将自然语言处理应用于临床数据仓库中的文本数据:系统评价。
JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.
2
Extraction of entity relations from Chinese medical literature based on multi-scale CRNN.基于多尺度CRNN的中文医学文献实体关系提取
Ann Transl Med. 2022 May;10(9):520. doi: 10.21037/atm-22-1226.
3
Reducing severe cutaneous adverse and type B adverse drug reactions using pre-stored human leukocyte antigen genotypes.

本文引用的文献

1
Mining Electronic Health Records to Extract Patient-Centered Outcomes Following Prostate Cancer Treatment.挖掘电子健康记录以提取前列腺癌治疗后的以患者为中心的结果。
AMIA Annu Symp Proc. 2018 Apr 16;2017:876-882. eCollection 2017.
2
Data Processing and Text Mining Technologies on Electronic Medical Records: A Review.电子病历的数据处理和文本挖掘技术:综述。
J Healthc Eng. 2018 Apr 8;2018:4302425. doi: 10.1155/2018/4302425. eCollection 2018.
3
HEDEA: A Python Tool for Extracting and Analysing Semi-structured Information from Medical Records.
利用预先储存的人类白细胞抗原基因型减少严重皮肤不良反应和B型药物不良反应。
Clin Transl Allergy. 2022 Jan 14;12(1):e12098. doi: 10.1002/clt2.12098. eCollection 2022 Jan.
4
Year 2020 (with COVID): Observation of Scientific Literature on Clinical Natural Language Processing.2020 年(含新冠疫情):临床自然语言处理相关科学文献观察
Yearb Med Inform. 2021 Aug;30(1):257-263. doi: 10.1055/s-0041-1726528. Epub 2021 Sep 3.
5
Generating real-world evidence from unstructured clinical notes to examine clinical utility of genetic tests: use case in BRCAness.从非结构化临床笔记中生成真实世界证据,以检验遗传检测的临床效用:BRCA 状态案例研究。
BMC Med Inform Decis Mak. 2021 Jan 6;21(1):3. doi: 10.1186/s12911-020-01364-y.
HEDEA:一种用于从医疗记录中提取和分析半结构化信息的Python工具。
Healthc Inform Res. 2018 Apr;24(2):148-153. doi: 10.4258/hir.2018.24.2.148. Epub 2018 Apr 30.
4
Clinical information extraction applications: A literature review.临床信息提取应用:文献综述。
J Biomed Inform. 2018 Jan;77:34-49. doi: 10.1016/j.jbi.2017.11.011. Epub 2017 Nov 21.
5
Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review.用于捕获和标准化非结构化临床信息的自然语言处理系统:一项系统综述。
J Biomed Inform. 2017 Sep;73:14-29. doi: 10.1016/j.jbi.2017.07.012. Epub 2017 Jul 17.
6
Data extraction from electronic health records (EHRs) for quality measurement of the physical therapy process: comparison between EHR data and survey data.从电子健康记录(EHRs)中提取数据以进行物理治疗过程的质量测量:EHR数据与调查数据的比较。
BMC Med Inform Decis Mak. 2016 Nov 8;16(1):141. doi: 10.1186/s12911-016-0382-4.
7
Clinical Role of Human Leukocyte Antigen in Health and Disease.人类白细胞抗原在健康与疾病中的临床作用
Scand J Immunol. 2015 Oct;82(4):283-306. doi: 10.1111/sji.12329.
8
A De-identification method for bilingual clinical texts of various note types.一种针对各种笔记类型的双语临床文本的去识别方法。
J Korean Med Sci. 2015 Jan;30(1):7-15. doi: 10.3346/jkms.2015.30.1.7. Epub 2014 Dec 23.
9
The IPD and IMGT/HLA database: allele variant databases.国际参与者数据(IPD)和国际免疫遗传学信息系统/HLA数据库:等位基因变异数据库。
Nucleic Acids Res. 2015 Jan;43(Database issue):D423-31. doi: 10.1093/nar/gku1161. Epub 2014 Nov 20.
10
Chapter 13: Mining electronic health records in the genomics era.第十三章:基因组时代的电子健康记录挖掘。
PLoS Comput Biol. 2012;8(12):e1002823. doi: 10.1371/journal.pcbi.1002823. Epub 2012 Dec 27.