• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术,这些管道处理非结构化和半结构化的医学文本。

Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.

机构信息

Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany.

Hasso-Plattner-Institut Fur Digital Engineering gGmbH, Potsdam, Germany.

出版信息

Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.

DOI:10.1177/14604582231164696
PMID:37068028
Abstract

BACKGROUND

Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy.

OBJECTIVES

In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language.

METHODS

The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts.

RESULTS

We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min.

CONCLUSION

We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.

摘要

背景

从医学报告的半结构化和非结构化文本中提取医学术语及其对应值可能是一个耗时且容易出错的过程。自然语言处理 (NLP) 方法可以帮助定义提取管道,以实现结构化格式转换策略。

目的

在本文中,我们构建了一个 NLP 管道,从非结构化和半结构化的病理报告中提取恶性肿瘤 (TNM) 的分类值,并将其进一步导入结构化数据源以进行临床研究。我们的研究兴趣不在于测试和验证数据上的标准性能指标,如精度、召回率和 F 度量。我们讨论了如何借助软件编程技术提高基于规则 (RB) 的信息提取 (IE) 管道的可读性,从而最小化纠正或更新规则的时间,并有效地将其导入另一种编程语言。

方法

提取规则是使用 TNM 分类的训练数据手动编程的,并根据来自领域专家和数据管理员的设计规范在两个单独的管道中进行测试。首先,我们为每个提取项直接在一行中编写每条规则。其次,我们通过分解和为变量声明赋予意图揭示的名称,以可读的方式重新编程它们。为了衡量这两种方法的影响,我们通过半结构化和非结构化文本的测试数据来衡量微调和编程提取的时间。

结果

我们分析了通过规则编写的可读性提高、使用正则表达式 (REGEX) 和 Apache Uima Ruta 语言 (AURL) 进行并行编程的好处。在 AURL 和 REGEX 中纠正可读规则的时间明显减少。在 REGEX 中分解复杂规则,并在 AURL 中重新编程意图揭示的声明,仅需 5 分钟。

结论

我们讨论了编程 RB 文本 IE 管道时可读性的重要性以及如何提高可读性。无论编程语言的特点和应用的工具如何,可读的编码策略都可以被证明对未来的维护有益,并为理解提取和将规则转移到其他领域和 NLP 管道提供可解释的解决方案。

相似文献

1
Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术,这些管道处理非结构化和半结构化的医学文本。
Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.
2
Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.设计一个基于 openEHR 的管道,使用自然语言处理提取和标准化非结构化临床数据。
Methods Inf Med. 2020 Dec;59(S 02):e64-e78. doi: 10.1055/s-0040-1716403. Epub 2020 Oct 14.
3
[A customized method for information extraction from unstructured text data in the electronic medical records].[一种从电子病历非结构化文本数据中提取信息的定制方法]
Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):256-263.
4
Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach.基于规则的方法从 HLA 报告的自由文本中提取结构化基因型信息。
J Korean Med Sci. 2020 Mar 30;35(12):e78. doi: 10.3346/jkms.2020.35.e78.
5
Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach.基于自然语言处理技术的意大利病理报告中癌症形态的自动分类:一种基于规则的方法。
J Biomed Inform. 2021 Apr;116:103712. doi: 10.1016/j.jbi.2021.103712. Epub 2021 Feb 18.
6
Natural language processing of radiology reports for identification of skeletal site-specific fractures.放射科报告的自然语言处理以识别骨骼部位特异性骨折。
BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):73. doi: 10.1186/s12911-019-0780-5.
7
PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。
J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.
8
Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system.利用自然语言处理从非结构化临床信件中提取结构化癫痫数据:ExECT(癫痫临床文本提取)系统的开发和验证。
BMJ Open. 2019 Apr 1;9(4):e023232. doi: 10.1136/bmjopen-2018-023232.
9
Facilitating clinical research through automation: Combining optical character recognition with natural language processing.通过自动化促进临床研究:结合光学字符识别和自然语言处理。
Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.
10
Natural language processing with machine learning methods to analyze unstructured patient-reported outcomes derived from electronic health records: A systematic review.使用机器学习方法进行自然语言处理,以分析来自电子健康记录的非结构化患者报告结局:系统评价。
Artif Intell Med. 2023 Dec;146:102701. doi: 10.1016/j.artmed.2023.102701. Epub 2023 Nov 1.