Rashid Sabbir M, McCusker James P, Pinheiro Paulo, Bax Marcello P, Santos Henrique, Stingone Jeanette A, Das Amar K, McGuinness Deborah L
Rensselaer Polytechnic Institute, Troy, NY, 12180, USA.
Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, BR.
Data Intell. 2020 Fall;2(4):443-486. doi: 10.1162/dint_a_00058. Epub 2020 Oct 22.
It is common practice for data providers to include text descriptions for each column when publishing datasets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a dataset, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse datasets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey dataset, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large NIH-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.
数据提供者在以数据字典的形式发布数据集时,为每一列包含文本描述是常见的做法。虽然这些文档有助于最终用户正确解释数据集中某一列的含义,但现有的数据字典通常不是机器可读的,也不遵循通用的规范标准。我们引入了语义数据字典,这是一种对数据的语义表示分配进行形式化的规范,能够实现跨不同数据集的标准化和协调统一。在本文中,我们在处理生物医学数据的工作背景下展示我们的语义数据字典工作;然而,这种方法能够且已经在广泛的领域中使用。以这种形式呈现数据有助于促进更好的发现、互操作性、重用性、可追溯性和可重复性。我们展示相关研究,并描述语义数据字典如何有助于解决相关文献中现有的局限性。我们讨论我们的方法,通过注释公开可用的国家健康与营养检查调查数据集的部分内容给出一个示例,提出建模挑战,并描述这种方法在赞助研究中的应用,包括我们在由美国国立卫生研究院资助的大型暴露与健康数据门户项目以及罗格斯大学 - 国际商业机器公司合作的“通过分析、学习和语义实现健康赋能”项目中的工作。我们将这项工作与传统数据字典、映射语言和数据集成工具进行比较评估。