利用临床文本完善荷兰全科医生电子健康记录数据中不明确的病症编码。

Using clinical text to refine unspecific condition codes in Dutch general practitioner EHR data.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

出版信息

Int J Med Inform. 2024 Sep;189:105506. doi: 10.1016/j.ijmedinf.2024.105506. Epub 2024 May 29.

DOI:10.1016/j.ijmedinf.2024.105506

PMID:38820647

Abstract

OBJECTIVE

Observational studies using electronic health record (EHR) databases often face challenges due to unspecific clinical codes that can obscure detailed medical information, hindering precise data analysis. In this study, we aimed to assess the feasibility of refining these unspecific condition codes into more specific codes in a Dutch general practitioner (GP) EHR database by leveraging the available clinical free text.

METHODS

We utilized three approaches for text classification-search queries, semi-supervised learning, and supervised learning-to improve the specificity of ten unspecific International Classification of Primary Care (ICPC-1) codes. Two text representations and three machine learning algorithms were evaluated for the (semi-)supervised models. Additionally, we measured the improvement achieved by the refinement process on all code occurrences in the database.

RESULTS

The classification models performed well for most codes. In general, no single classification approach consistently outperformed the others. However, there were variations in the relative performance of the classification approaches within each code and in the use of different text representations and machine learning algorithms. Class imbalance and limited training data affected the performance of the (semi-)supervised models, yet the simple search queries remained particularly effective. Ultimately, the developed models improved the specificity of over half of all the unspecific code occurrences in the database.

CONCLUSIONS

Our findings show the feasibility of using information from clinical text to improve the specificity of unspecific condition codes in observational healthcare databases, even with a limited range of machine-learning techniques and modest annotated training sets. Future work could investigate transfer learning, integration of structured data, alternative semi-supervised methods, and validation of models across healthcare settings. The improved level of detail enriches the interpretation of medical information and can benefit observational research and patient care.

摘要

目的

利用电子健康记录（EHR）数据库进行观察性研究常常面临挑战，因为非特异性的临床代码可能会掩盖详细的医疗信息，从而阻碍精确的数据分析。在这项研究中，我们旨在评估通过利用可用的临床自由文本，将荷兰全科医生（GP）EHR 数据库中这些非特异性的初级保健国际分类（ICPC-1）代码细化为更具体代码的可行性。

方法

我们利用三种文本分类方法——搜索查询、半监督学习和监督学习，来提高十个非特异性的 ICPC-1 代码的特异性。评估了两种文本表示形式和三种机器学习算法的（半）监督模型。此外，我们还衡量了细化过程对数据库中所有代码出现的改进情况。

结果

分类模型在大多数代码上表现良好。一般来说，没有一种单一的分类方法始终优于其他方法。然而，在每个代码内和在使用不同的文本表示形式和机器学习算法时，分类方法的相对性能存在差异。分类不平衡和有限的训练数据影响了（半）监督模型的性能，但简单的搜索查询仍然特别有效。最终，所开发的模型提高了数据库中非特异性代码出现的一半以上的特异性。

结论

我们的研究结果表明，即使使用有限的机器学习技术和适度的注释训练集，也可以利用临床文本信息来提高观察性医疗保健数据库中非特异性条件代码的特异性。未来的工作可以研究迁移学习、结构化数据的整合、替代的半监督方法以及在不同医疗保健环境下对模型的验证。细化后的详细程度丰富了对医疗信息的解释，并可以使观察性研究和患者护理受益。