Frid Santiago, Bracons Cucó Guillem, Gil Rojas Jessyca, López-Rueda Antonio, Pastor Duran Xavier, Martínez-Sáez Olga, Lozano-Rubí Raimundo
Clinical Informatics Service, Hospital Clínic de Barcelona, Villarroel 170, 08036 Barcelona, Spain. Electronic address: https://twitter.com/santifrik.
Fundació de Recerca Clínic Barcelona - Institut d'Investigacions Biomèdiques August Pi i Sunyer, Rosselló 149-153, 08036 Barcelona, Spain.
J Biomed Inform. 2023 Nov;147:104505. doi: 10.1016/j.jbi.2023.104505. Epub 2023 Sep 27.
Observational research in cancer poses great challenges regarding adequate data sharing and consolidation based on a homogeneous data semantic base. Common Data Models (CDMs) can help consolidate health data repositories from different institutions minimizing loss of meaning by organizing data into a standard structure. This study aims to evaluate the performance of the Observational Medical Outcomes Partnership (OMOP) CDM, Informatics for Integrating Biology & the Bedside (i2b2) and International Cancer Genome Consortium, Accelerating Research in Genomic Oncology (ICGC ARGO) for representing non-imaging data in a breast cancer use case of EuCanImage.
We used ontologies to represent metamodels of OMOP, i2b2, and ICGC ARGO and variables used in a cancer use case of a European AI project. We selected four evaluation criteria for the CDMs adapted from previous research: content coverage, simplicity, integration, implementability.
i2b2 and OMOP exhibited higher element completeness (100% each) than ICGC ARGO (58.1%), while the three achieved 100% domain completeness. ICGC ARGO normalizes only one of our variables with a standard terminology, while i2b2 and OMOP use standardized vocabularies for all of them. In terms of simplicity, ICGC ARGO and i2b2 proved to be simpler both in terms of ontological model (276 and 175 elements, respectively) and in the queries (7 and 20 lines of code, respectively), while OMOP required a much more complex ontological model (615 elements) and queries similar to those of i2b2 (20 lines). Regarding implementability, OMOP had the highest number of mentions in articles in PubMed (130) and Google Scholar (1,810), ICGC ARGO had the highest number of updates to the CDM since 2020 (4), and i2b2 is the model with more tools specifically developed for the CDM (26).
ICGC ARGO proved to be rigid and very limited in the representation of oncologic concepts, while i2b2 and OMOP showed a very good performance. i2b2's lack of a common dictionary hinders its scalability, requiring sites that will share data to explicitly define a conceptual framework, and suggesting that OMOP and its Oncology extension could be the more suitable choice. Future research employing these CDMs with actual datasets is needed.
癌症方面的观察性研究在基于统一数据语义基础进行充分的数据共享和整合方面面临巨大挑战。通用数据模型(CDM)有助于整合来自不同机构的健康数据存储库,通过将数据组织成标准结构来尽量减少意义的损失。本研究旨在评估观察性医疗结果合作组织(OMOP)CDM、整合生物学与床边信息学(i2b2)以及国际癌症基因组联盟加速基因组肿瘤学研究(ICGC ARGO)在EuCanImage乳腺癌用例中表示非成像数据的性能。
我们使用本体来表示OMOP、i2b2和ICGC ARGO的元模型以及一个欧洲人工智能项目癌症用例中使用的变量。我们从先前的研究中选取了四个用于评估CDM的标准:内容覆盖度、简单性、整合性、可实施性。
i2b2和OMOP的元素完整性(各为100%)高于ICGC ARGO(58.1%),而三者的领域完整性均达到100%。ICGC ARGO仅用标准术语对我们的一个变量进行了标准化,而i2b2和OMOP对所有变量都使用了标准化词汇表。在简单性方面,ICGC ARGO和i2b2在本体模型(分别为276和175个元素)和查询(分别为7行和20行代码)方面都更简单,而OMOP需要一个复杂得多的本体模型(615个元素)且查询与i2b2类似(20行)。在可实施性方面,OMOP在PubMed文章(130篇)和谷歌学术(1810篇)中的提及次数最多,ICGC ARGO自2020年以来对CDM的更新次数最多(4次),i2b2是专门为该CDM开发的工具最多的模型(26个)。
ICGC ARGO在肿瘤学概念表示方面表现僵化且非常有限,而i2b2和OMOP表现良好。i2b2缺乏通用词典阻碍了其可扩展性,要求共享数据的站点明确定义概念框架,这表明OMOP及其肿瘤学扩展可能是更合适的选择。需要使用这些CDM和实际数据集进行未来研究。