IEEE J Biomed Health Inform. 2020 Jul;24(7):1952-1967. doi: 10.1109/JBHI.2020.2990797. Epub 2020 May 4.
Cancer registries collect unstructured and structured cancer data for surveillance purposes which provide important insights regarding cancer characteristics, treatments, and outcomes. Cancer registry data typically (1) categorize each reportable cancer case or tumor at the time of diagnosis, (2) contain demographic information about the patient such as age, gender, and location at time of diagnosis, (3) include planned and completed primary treatment information, and (4) may contain survival outcomes. As structured data is being extracted from various unstructured sources, such as pathology reports, radiology reports, medical records, and stored for reporting and other needs, the associated information representing a reportable cancer is constantly expanding and evolving. While some popular analytic approaches including SEER*Stat and SAS exist, we provide a knowledge graph approach to organizing cancer registry data. Our approach offers unique advantages for timely data analysis and presentation and visualization of valuable information. This knowledge graph approach semantically enriches the data, and easily enables linking with third-party data which can help explain variation in cancer incidence patterns, disparities, and outcomes. We developed a prototype knowledge graph based on the Louisiana Tumor Registry dataset. We present the advantages of the knowledge graph approach by examining: i) scenario-specific queries, ii) links with openly available external datasets, iii) schema evolution for iterative analysis, and iv) data visualization. Our results demonstrate that this graph based solution can perform complex queries, improve query run-time performance by up to 76%, and more easily conduct iterative analyses to enhance researchers' understanding of cancer registry data.
癌症登记处收集非结构化和结构化的癌症数据,用于监测目的,提供有关癌症特征、治疗和结果的重要见解。癌症登记处的数据通常:(1) 在诊断时对每个可报告的癌症病例或肿瘤进行分类;(2) 包含患者的人口统计学信息,如年龄、性别和诊断时的位置;(3) 包括计划和完成的主要治疗信息;(4) 可能包含生存结果。随着结构化数据从各种非结构化来源(如病理报告、放射学报告、医疗记录)中提取并存储用于报告和其他需求,代表可报告癌症的相关信息不断扩展和发展。虽然存在一些流行的分析方法,如 SEER*Stat 和 SAS,但我们提供了一种知识图谱方法来组织癌症登记处数据。我们的方法为及时数据分析以及有价值信息的呈现和可视化提供了独特的优势。这种知识图谱方法使数据语义丰富,并轻松实现与第三方数据的链接,这有助于解释癌症发病率模式、差异和结果的变化。我们基于路易斯安那州肿瘤登记数据集开发了一个原型知识图谱。我们通过检查以下内容来展示知识图谱方法的优势:i)特定场景的查询;ii)与公开可用的外部数据集的链接;iii)用于迭代分析的模式演变;iv)数据可视化。我们的结果表明,这种基于图的解决方案可以执行复杂的查询,将查询运行时性能提高多达 76%,并且更轻松地进行迭代分析,从而增强研究人员对癌症登记处数据的理解。