Public Health Department, Strasbourg University Hospital, 67000, Strasbourg, France.
Public Health Department, Strasbourg University Hospital, 67000, Strasbourg, France.
Int J Med Inform. 2020 Jul;139:104139. doi: 10.1016/j.ijmedinf.2020.104139. Epub 2020 Apr 9.
Cancer registries are collections of curated data about malignant tumor diseases. The amount of data processed by cancer registries increases every year, making manual registration more and more tedious.
We sought to develop an automatic analysis pipeline that would be able to identify and preprocess registry input for incident prostate adenocarcinomas in a French regional cancer registry.
Notifications from different sources submitted to the Bas-Rhin cancer registry were used here: pathology data and, ICD 10 diagnosis codes from hospital discharge data and healthcare insurance data. We trained a Support Vector Machine model (machine learning) to predict whether patient's data must be considered or not as a prostate adenocarcinoma incident case that should therefore be registered. The final registration of all identified cases was manually confirmed by a specialized technician. Text mining tools (regular expressions) were used to extract clinical and biological data from non-structured pathology reports.
We performed two successive analyses. First, we used 982 cases manually labeled by registrars from the 2014 dataset to predict the registration of 785 cases submitted in 2015. Then, we repeated the procedure using the 2089 cases labeled by registrars from the 2014 and 2015 datasets to predict the registration of 926 cases submitted in the 2016 data. The algorithm identified 663 cases of prostate adenocarcinoma in 2015, and 610 in 2016. From these findings, 663 and 531 cases were respectively added to the registry; and 641 and 512 cases were confirmed by the specialized technician. This registration process has achieved a precision level above 96 %. The algorithm obtained an overall precision of 99 % (99.5 % in 2015 and 98.5 % in 2016) and a recall of 97 % (97.8 % in 2015 and 96.9 % in 2016). When the information was found in pathology report, text mining was more than 90 % accuracy for major indicators: PSA test, Gleason score, and incidence date). For both PSA and tumor side, information was not detected in the majority of cases."
Machine learning was able to identify new cases of prostate cancer, and text mining was able to prefill the data about incident cases. Machine-learning-based automation of the registration process could reduce delays in data production and allow investigators to devote more time to complex tasks and analysis.
目的:我们旨在开发一种自动分析管道,以便能够识别和预处理法国地区癌症登记处中前列腺腺癌的登记输入。
方法:我们使用了来自下莱茵癌症登记处的不同来源的通知:病理数据以及来自医院出院数据和医疗保险数据的 ICD-10 诊断代码。我们训练了一个支持向量机模型(机器学习),以预测患者的数据是否必须被视为前列腺腺癌的新发病例,因此应进行登记。所有识别病例的最终登记均由专门的技术人员手动确认。文本挖掘工具(正则表达式)用于从非结构化病理报告中提取临床和生物学数据。
结果:我们进行了两次连续分析。首先,我们使用了 982 例由登记员手动标记的病例,来预测 2015 年提交的 785 例病例的登记情况。然后,我们使用了由登记员从 2014 年和 2015 年的数据集中标记的 2089 例病例来预测 2016 年提交的 926 例病例的登记情况。该算法在 2015 年识别了 663 例前列腺腺癌病例,在 2016 年识别了 610 例。从这些发现中,分别有 663 例和 531 例被添加到登记处,并且有 641 例和 512 例被专门的技术人员确认。该登记过程的准确率达到了 96%以上。该算法的整体准确率为 99%(2015 年为 99.5%,2016 年为 98.5%),召回率为 97%(2015 年为 97.8%,2016 年为 96.9%)。当在病理报告中找到信息时,文本挖掘在主要指标(PSA 测试、格里森评分和发病日期)上的准确率超过 90%。对于 PSA 和肿瘤侧,大多数病例都没有检测到信息。
结论:机器学习能够识别前列腺癌的新病例,并且文本挖掘能够预先填充关于新发病例的数据。基于机器学习的登记过程自动化可以减少数据生成的延迟,并使研究人员能够将更多的时间用于复杂任务和分析。