用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术，这些管道处理非结构化和半结构化的医学文本。

Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.

机构信息

Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Hannover, Germany.

Hasso-Plattner-Institut Fur Digital Engineering gGmbH, Potsdam, Germany.

出版信息

Health Informatics J. 2023 Apr-Jun;29(2):14604582231164696. doi: 10.1177/14604582231164696.

DOI:10.1177/14604582231164696

PMID:37068028

Abstract

BACKGROUND

Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy.

OBJECTIVES

In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language.

METHODS

The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts.

RESULTS

We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min.

CONCLUSION

We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.

摘要

背景

从医学报告的半结构化和非结构化文本中提取医学术语及其对应值可能是一个耗时且容易出错的过程。自然语言处理 (NLP) 方法可以帮助定义提取管道，以实现结构化格式转换策略。

目的

在本文中，我们构建了一个 NLP 管道，从非结构化和半结构化的病理报告中提取恶性肿瘤 (TNM) 的分类值，并将其进一步导入结构化数据源以进行临床研究。我们的研究兴趣不在于测试和验证数据上的标准性能指标，如精度、召回率和 F 度量。我们讨论了如何借助软件编程技术提高基于规则 (RB) 的信息提取 (IE) 管道的可读性，从而最小化纠正或更新规则的时间，并有效地将其导入另一种编程语言。

方法

提取规则是使用 TNM 分类的训练数据手动编程的，并根据来自领域专家和数据管理员的设计规范在两个单独的管道中进行测试。首先，我们为每个提取项直接在一行中编写每条规则。其次，我们通过分解和为变量声明赋予意图揭示的名称，以可读的方式重新编程它们。为了衡量这两种方法的影响，我们通过半结构化和非结构化文本的测试数据来衡量微调和编程提取的时间。

结果

我们分析了通过规则编写的可读性提高、使用正则表达式 (REGEX) 和 Apache Uima Ruta 语言 (AURL) 进行并行编程的好处。在 AURL 和 REGEX 中纠正可读规则的时间明显减少。在 REGEX 中分解复杂规则，并在 AURL 中重新编程意图揭示的声明，仅需 5 分钟。

结论

我们讨论了编程 RB 文本 IE 管道时可读性的重要性以及如何提高可读性。无论编程语言的特点和应用的工具如何，可读的编码策略都可以被证明对未来的维护有益，并为理解提取和将规则转移到其他领域和 NLP 管道提供可解释的解决方案。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术，这些管道处理非结构化和半结构化的医学文本。

Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.

机构信息

出版信息

BACKGROUND

OBJECTIVES

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献

用于改进基于规则的信息抽取自然语言处理管道的规则可读性的编程技术，这些管道处理非结构化和半结构化的医学文本。

Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts.

机构信息

出版信息

BACKGROUND

OBJECTIVES

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献