Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany.
Hasso-Plattner-Institut Fur Digital Engineering gGmbH, Potsdam, Germany.
Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.
Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy.
In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language.
The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts.
We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min.
We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.
从医学报告的半结构化和非结构化文本中提取医学术语及其对应值可能是一个耗时且容易出错的过程。自然语言处理 (NLP) 方法可以帮助定义提取管道,以实现结构化格式转换策略。
在本文中,我们构建了一个 NLP 管道,从非结构化和半结构化的病理报告中提取恶性肿瘤 (TNM) 的分类值,并将其进一步导入结构化数据源以进行临床研究。我们的研究兴趣不在于测试和验证数据上的标准性能指标,如精度、召回率和 F 度量。我们讨论了如何借助软件编程技术提高基于规则 (RB) 的信息提取 (IE) 管道的可读性,从而最小化纠正或更新规则的时间,并有效地将其导入另一种编程语言。
提取规则是使用 TNM 分类的训练数据手动编程的,并根据来自领域专家和数据管理员的设计规范在两个单独的管道中进行测试。首先,我们为每个提取项直接在一行中编写每条规则。其次,我们通过分解和为变量声明赋予意图揭示的名称,以可读的方式重新编程它们。为了衡量这两种方法的影响,我们通过半结构化和非结构化文本的测试数据来衡量微调和编程提取的时间。
我们分析了通过规则编写的可读性提高、使用正则表达式 (REGEX) 和 Apache Uima Ruta 语言 (AURL) 进行并行编程的好处。在 AURL 和 REGEX 中纠正可读规则的时间明显减少。在 REGEX 中分解复杂规则,并在 AURL 中重新编程意图揭示的声明,仅需 5 分钟。
我们讨论了编程 RB 文本 IE 管道时可读性的重要性以及如何提高可读性。无论编程语言的特点和应用的工具如何,可读的编码策略都可以被证明对未来的维护有益,并为理解提取和将规则转移到其他领域和 NLP 管道提供可解释的解决方案。