Huser Vojtech, Amos Liz
National Library of Medicine.
National Institutes of Health, Bethesda, MD, USA.
AMIA Annu Symp Proc. 2018 Dec 5;2018:602-608. eCollection 2018.
Common Data Elements (CDEs) are defined as "data elements that are common to multiple data sets across different studies" and provide structured, standardized definitions so that data may be collected and used across different datasets. CDE collections are traditionally developed prospectively by subject-matter and domain experts. However, there has been little systematic research and evidence to demonstrate how CDEs are used in real-world datasets and the subsequent impact on data discoverability. Our study builds upon previous mapping work to investigate the number of CDEs that could be identified using a varying level of commonness threshold in a real-world data repository, the Database of Phenotypes and Genotypes (dbGaP). In an analyzed collection of mapped variables from 426 dbGaP studies, only 1,414 PhenX variables (PHENotypes and eXposures; a CDE initiative) are observed out of all 24,938 defined PhenX variables. Results include CDEs that are identified with varying levels of commonness thresholds. After the semantic grouping of 68 PhenX variables collected in at least 15 studies (n=15), we observed 32 truly "common" common data elements. We discuss benefits of post-hoc mapping of study data to a CDE framework for purposes of findability and reuse, as well as the informatics challenges of pre-populating clinical research case report forms with data from Electronic Health Record that are typically coded in terminologies aimed at routine healthcare needs.
通用数据元素(CDEs)被定义为“跨不同研究的多个数据集中通用的数据元素”,并提供结构化、标准化的定义,以便在不同数据集中收集和使用数据。传统上,CDE集合由主题和领域专家前瞻性地开发。然而,几乎没有系统的研究和证据来证明CDEs在实际数据集里是如何使用的,以及对数据可发现性的后续影响。我们的研究基于之前的映射工作,以调查在一个实际数据存储库——表型和基因型数据库(dbGaP)中,使用不同通用程度阈值能够识别出的CDE数量。在对来自426项dbGaP研究的映射变量进行分析的集合中,在所有24938个已定义的PhenX变量中,仅观察到1414个PhenX变量(表型和暴露;一项CDE计划)。结果包括以不同通用程度阈值识别出的CDEs。在对至少15项研究(n = 15)中收集到的68个PhenX变量进行语义分组后,我们观察到32个真正“通用”的通用数据元素。我们讨论了为便于查找和重用而将研究数据事后映射到CDE框架的好处,以及用通常以针对常规医疗需求的术语编码的电子健康记录数据预填充临床研究病例报告表所面临的信息学挑战。