Memorial Sloan Kettering Cancer Center, New York, NY.
Center for Translational Data Science, University of Chicago, Chicago, IL.
JCO Clin Cancer Inform. 2020 Aug;4:691-699. doi: 10.1200/CCI.20.00037.
As data-sharing projects become increasingly frequent, so does the need to map data elements between multiple classification systems. A generic, robust, shareable architecture will result in increased efficiency and transparency of the mapping process, while upholding the integrity of the data.
The American Association for Cancer Research's Genomics Evidence Neoplasia Information Exchange (GENIE) collects clinical and genomic data for precision cancer medicine. As part of its commitment to open science, GENIE has partnered with the National Cancer Institute's Genomic Data Commons (GDC) as a secondary repository. After initial efforts to submit data from GENIE to GDC failed, we realized the need for a solution to allow for the iterative mapping of data elements between dynamic classification systems. We developed the Linked Entity Attribute Pair (LEAP) database framework to store and manage the term mappings used to submit data from GENIE to GDC.
After creating and populating the LEAP framework, we identified 195 mappings from GENIE to GDC requiring remediation and observed a 28% reduction in effort to resolve these issues, as well as a reduction in inadvertent errors. These results led to a decrease in the time to map between OncoTree, the cancer type ontology used by GENIE, and International Classification of Disease for Oncology, 3rd Edition, used by GDC, from several months to less than 1 week.
The LEAP framework provides a streamlined mapping process among various classification systems and allows for reusability so that efforts to create or adjust mappings are straightforward. The ability of the framework to track changes over time streamlines the process to map data elements across various dynamic classification systems.
随着数据共享项目的日益频繁,需要在多个分类系统之间映射数据元素。通用、强大、可共享的架构将提高映射过程的效率和透明度,同时保持数据的完整性。
美国癌症研究协会的基因组学证据肿瘤信息交换(GENIE)为精准癌症医学收集临床和基因组数据。作为其开放科学承诺的一部分,GENIE 与美国国立癌症研究所的基因组数据共享中心(GDC)合作,作为二级存储库。在最初努力将 GENIE 数据提交到 GDC 失败后,我们意识到需要一个解决方案,以允许在动态分类系统之间迭代映射数据元素。我们开发了链接实体属性对(LEAP)数据库框架来存储和管理用于将 GENIE 数据提交到 GDC 的术语映射。
在创建和填充 LEAP 框架后,我们确定了 195 个从 GENIE 到 GDC 的映射需要修复,并观察到解决这些问题的工作量减少了 28%,并且无意中的错误也减少了。这些结果导致在 GENIE 使用的癌症类型本体 OncoTree 和 GDC 使用的国际肿瘤学疾病分类第 3 版之间进行映射的时间从几个月减少到不到 1 周。
LEAP 框架提供了各种分类系统之间的简化映射流程,并允许重用,因此创建或调整映射的工作非常简单。该框架能够跟踪随时间的变化,简化了在各种动态分类系统之间映射数据元素的过程。