Rios Anthony, Durbin Eric B, Hands Isaac, Kavuluru Ramakanth
Dept. of Information Systems & Cyber Security, Cyber Center for Security & Analytics, University of Texas at San Antonio, San Antonio, Texas, USA.
Division of Biomedical Informatics (Internal Medicine), Kentucky Cancer Registry, University of Kentucky, Lexington, Kentucky, USA.
ACM BCB. 2021 Aug;2021. doi: 10.1145/3459930.3469541.
Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.
追踪人群层面的癌症信息对研究人员、临床医生、政策制定者和公众来说至关重要。不幸的是,大部分信息都以非结构化数据的形式存储在病理报告中。因此,为了处理这些信息,我们需要自动化提取技术或人工整理。此外,许多与癌症相关的概念在现实世界的训练数据集中很少出现。由于数据有限,自动化提取很困难。本研究引入了一种新颖的技术,该技术结合结构化专家知识来改进组织学和地形学代码分类模型。利用从肯塔基癌症登记处收集的病理报告,我们引入了一种具有分层正则化的新颖多任务训练方法,该方法纳入了有关《国际疾病分类肿瘤学》第三版类别的结构化信息,以提高预测性能。总体而言,我们发现我们的方法提高了微观和宏观F1值。对于宏观F1值,我们在地形学代码方面实现了高达6%的绝对提升,在组织学代码方面实现了高达4%的绝对提升。