Napolitano Giulio, Marshall Adele, Hamilton Peter, Gavin Anna T
Institut für Medizinische Biometrie, Informatik und Epidemiologie (IMBIE), Universität Bonn, Haus 325/11/1.OG/Raum 620, Sigmund-Freud-Straße 25, 53105 Bonn, Germany.
Queen's University Belfast, School of Mathematics and Physics, University Road, Belfast BT7 1NN, United Kingdom.
Artif Intell Med. 2016 Jun;70:77-83. doi: 10.1016/j.artmed.2016.06.001. Epub 2016 Jun 8.
Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: 'semi-structured' and 'unstructured'. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
The best result of 99.4% accuracy - which included only one semi-structured report predicted as unstructured - was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
用于癌症相关临床文档文本挖掘的机器学习技术尚未得到充分探索。本文介绍了一些用于自由文本乳腺癌病理报告预处理的技术,旨在促进与癌症分期相关信息的提取。
第一种技术使用免费软件RapidMiner根据报告的总体布局将其分类为“半结构化”和“非结构化”。第二种技术使用开源语言工程框架GATE开发,旨在预测报告文本中包含与癌症形态、肿瘤大小、激素受体状态和阳性淋巴结数量相关信息的片段。分类器分别在来自北爱尔兰癌症登记处的635份和163份手动分类或注释的报告集上进行训练和测试。
布局分类器使用k最近邻算法,采用带有停用词过滤器和修剪的二元词出现词向量类型,产生了99.4%准确率的最佳结果——其中仅一份半结构化报告被预测为非结构化。对于片段识别,除了预测包含癌症形态的片段外,在所有情况下使用相同参数的PAUM算法都能得到最佳结果。对于半结构化报告,精确率和召回率分别在0.97至0.94以及0.92至0.83之间,而对于非结构化报告,精确率和召回率分别在0.91至0.64以及0.68至0.41之间。当分类器在半结构化报告上训练但在非结构化报告上测试时,结果较差。
这些结果表明,预测报告的布局是可行且有益的,并且报告中哪些部分可能包含特定信息的预测准确性对报告布局和所寻求信息的类型敏感。