Department of Computing and Information Systems, The University of Melbourne, Doug McDonell Building, Parkville, 3010 VIC, Australia.
Barwon Health, Geelong Hospital, 1/75 Bellerine Street, Geelong, 3220 VIC, Australia.
Artif Intell Med. 2014 Sep;62(1):11-21. doi: 10.1016/j.artmed.2014.06.002. Epub 2014 Jun 21.
We address the task of extracting information from free-text pathology reports, focusing on staging information encoded by the TNM (tumour-node-metastases) and ACPS (Australian clinico-pathological stage) systems. Staging information is critical for diagnosing the extent of cancer in a patient and for planning individualised treatment. Extracting such information into more structured form saves time, improves reporting, and underpins the potential for automated decision support.
We investigate the portability of a text mining model constructed from records from one health centre, by applying it directly to the extraction task over a set of records from a different health centre, with different reporting narrative characteristics. Other than a simple normalisation step on features associated with target labels, we apply the models from one system directly to the other.
The best F-scores for in-hospital experiments are 81%, 85%, and 94% (for staging T, N, and M respectively), while best cross-hospital F-scores reach 84%, 81%, and 91% for the same respective categories.
Our performance results compare favourably to the best levels reported in the literature, and--most relevant to our aim here--the cross-corpus results demonstrate the portability of the models we developed.
从病理报告的自由文本中提取信息,重点关注 TNM(肿瘤-淋巴结-转移)和 ACPS(澳大利亚临床病理分期)系统编码的分期信息。分期信息对于诊断患者癌症的严重程度和制定个体化治疗方案至关重要。将此类信息提取到更结构化的形式中可以节省时间、提高报告质量,并为自动化决策支持提供潜力。
我们通过将模型直接应用于来自另一个健康中心的记录集,研究了从一个健康中心的记录构建的文本挖掘模型的可移植性,这些记录具有不同的报告叙述特征。除了对与目标标签相关的特征进行简单的规范化处理之外,我们直接将一个系统的模型应用于另一个系统。
住院内实验的最佳 F 分数分别为 81%、85% 和 94%(分别用于分期 T、N 和 M),而最佳跨医院 F 分数分别为 84%、81% 和 91%,用于相同的相应类别。
我们的性能结果与文献中报告的最佳水平相当,并且——与我们在这里的目标最相关——跨语料库的结果证明了我们开发的模型的可移植性。