Lister Hill National Center for Biomedical Communication, National Library of Medicine, NIH, Bethesda, MD, United States of America.
PLoS One. 2020 Oct 5;15(10):e0240047. doi: 10.1371/journal.pone.0240047. eCollection 2020.
Efforts to define research Common Data Elements try to harmonize data collection across clinical studies.
Our goal was to analyze the quality and usability of data dictionaries of HIV studies.
For the clinical domain of HIV, we searched data sharing platforms and acquired a set of 18 HIV related studies from which we analyzed 26 328 data elements. We identified existing standards for creating a data dictionary and reviewed their use. To facilitate aggregation across studies, we defined three types of data dictionary (data element, forms, and permissible values) and created a simple information model for each type.
An average study had 427 data elements (ranging from 46 elements to 9 945 elements). In terms of data type, 48.6% of data elements were string, 47.8% were numeric, 3.0% were date and 0.6% were date-time. No study in our sample explicitly declared a data element as a categorical variable and rather considered them either strings or numeric. Only for 61% of studies were we able to obtain permissible values. The majority of studies used CSV files to share a data dictionary while 22% of the studies used a non-computable, PDF format. All studies grouped their data elements. The average number of groups or forms per study was 24 (ranging between 2 and 124 groups/forms). An accurate and well formatted data dictionary facilitates error-free secondary analysis and can help with data de-identification.
We saw features of data dictionaries that made them difficult to use and understand. This included multiple data dictionary files or non-machine-readable documents, data elements included in data but not in the dictionary or missing data types or descriptions. Building on experience with aggregating data elements across a large set of studies, we created a set of recommendations (called CONSIDER statement) that can guide optimal data sharing of future studies.
努力定义研究通用数据元素旨在协调临床研究中的数据收集。
我们的目标是分析 HIV 研究的数据字典的质量和可用性。
针对 HIV 的临床领域,我们在数据共享平台上进行了搜索,并从其中获取了一组 18 项与 HIV 相关的研究,我们对其中的 26328 个数据元素进行了分析。我们确定了创建数据字典的现有标准,并对其使用情况进行了审查。为了便于跨研究进行聚合,我们定义了三种类型的数据字典(数据元素、表单和允许的值),并为每种类型创建了一个简单的信息模型。
一项平均研究有 427 个数据元素(范围从 46 个元素到 9945 个元素)。从数据类型来看,48.6%的数据元素为字符串,47.8%为数值,3.0%为日期,0.6%为日期时间。我们的样本中没有研究明确将数据元素声明为分类变量,而是将其视为字符串或数值。只有 61%的研究能够获得允许的值。大多数研究使用 CSV 文件来共享数据字典,而 22%的研究使用不可计算的 PDF 格式。所有研究都对其数据元素进行了分组。平均每个研究的分组或表单数量为 24(范围在 2 到 124 个分组/表单之间)。准确且格式良好的数据字典可促进无错误的二次分析,并有助于数据去标识。
我们发现数据字典的一些特性使其难以使用和理解。这包括多个数据字典文件或不可机器读取的文档、包含在数据中但不在字典中的数据元素、缺失的数据类型或描述。基于在一组大型研究中聚合数据元素的经验,我们创建了一组建议(称为 CONSIDER 声明),可指导未来研究的最佳数据共享。