Cancer Registry, Institute for Digital Health Data Rhineland-Palatinate, Große Bleiche 46, Mainz, 55116, Germany; Information Systems and Business Administration, Johannes Gutenberg University, Jakob-Welder-Weg 9, Mainz, 55128, Germany.
Information Systems and Business Administration, Johannes Gutenberg University, Jakob-Welder-Weg 9, Mainz, 55128, Germany.
Int J Med Inform. 2024 May;185:105387. doi: 10.1016/j.ijmedinf.2024.105387. Epub 2024 Feb 28.
Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries.
We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors.
In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and F score (accuracy: 0.968-0.978, F score: 0.983-0.988) than the deterministic baseline (accuracy: 0.964, F score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and F score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases.
Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.
癌症登记处将大量医疗机构报告的电子健康记录与已注册的个体和肿瘤匹配记录相关联。记录使用确定性和概率性方法自动链接;很少使用机器学习。那些无法以足够的精度自动匹配的记录通常需要手动处理。对于应用程序,了解记录链接方法与实际记录的匹配程度以及达到所需链接质量所需的手动工作量非常重要。我们研究了将报告记录与癌症登记处匹配的注册肿瘤进行链接的任务。
我们比较了五种机器学习方法(逻辑回归、随机森林、梯度提升、神经网络和堆叠方法)与确定性基线之间的链接质量和手动工作量之间的权衡。在两类设置(不匹配/匹配)和三类设置(不匹配/未决/匹配)中比较了记录链接方法。癌症登记处收集并链接了一个数据集,该数据集由匹配 145755 份报告记录和 33289 份注册肿瘤的分类变量组成。
在两类设置中,与确定性基线(准确率:0.964,F 分数:0.980)相比,当手动处理相同记录(所有记录的 0.89%)时,梯度提升、神经网络和堆叠模型具有更高的准确率和 F 分数(准确率:0.968-0.978,F 分数:0.983-0.988)。在三类设置中,这些三种机器学习方法可以自动处理所有报告记录,并且仍然比确定性基线具有更高的准确率和 F 分数。与确定性基线相比,所研究的机器学习方法的链接质量随着手动处理记录数量的增加而增加,除了神经网络。
与确定性基线相比,机器学习方法可以显著提高链接质量,并减少癌症登记处中医疗编码员匹配肿瘤记录所需的手动工作量。我们的研究结果有助于癌症登记处估计随着手动处理的记录数量增加,链接质量如何提高。