Shi Jingyi, Zheng Mingna, Yao Lixia, Ge Yaorong
Department of Software and Information Systems, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, 28223, NC, USA.
Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, 55905, MN, USA.
BMC Med Genomics. 2018 Nov 20;11(Suppl 5):102. doi: 10.1186/s12920-018-0411-5.
The right dataset is essential to obtain the right insights in data science; therefore, it is important for data scientists to have a good understanding of the availability of relevant datasets as well as the content, structure, and existing analyses of these datasets. While a number of efforts are underway to integrate the large amount and variety of datasets, the lack of an information resource that focuses on specific needs of target users of datasets has existed as a problem for years. To address this gap, we have developed a Dataset Information Resource (DIR), using a user-oriented approach, which gathers relevant dataset knowledge for specific user types. In the present version, we specifically address the challenges of entry-level data scientists in learning to identify, understand, and analyze major datasets in healthcare. We emphasize that the DIR does not contain actual data from the datasets but aims to provide comprehensive knowledge about the datasets and their analyses.
The DIR leverages Semantic Web technologies and the W3C Dataset Description Profile as the standard for knowledge integration and representation. To extract tailored knowledge for target users, we have developed methods for manual extractions from dataset documentations as well as semi-automatic extractions from related publications, using natural language processing (NLP)-based approaches. A semantic query component is available for knowledge retrieval, and a parameterized question-answering functionality is provided to facilitate the ease of search.
The DIR prototype is composed of four major components-dataset metadata and related knowledge, search modules, question answering for frequently-asked questions, and blogs. The current implementation includes information on 12 commonly used large and complex healthcare datasets. The initial usage evaluation based on health informatics novices indicates that the DIR is helpful and beginner-friendly.
We have developed a novel user-oriented DIR that provides dataset knowledge specialized for target user groups. Knowledge about datasets is effectively represented in the Semantic Web. At this initial stage, the DIR has already been able to provide sophisticated and relevant knowledge of 12 datasets to help entry health informacians learn healthcare data analysis using suitable datasets. Further development of both content and function levels is underway.
正确的数据集对于在数据科学中获得正确的见解至关重要;因此,数据科学家深入了解相关数据集的可用性以及这些数据集的内容、结构和现有分析非常重要。尽管正在进行多项努力来整合大量且多样的数据集,但多年来一直存在缺乏专注于数据集目标用户特定需求的信息资源这一问题。为了填补这一空白,我们采用以用户为导向的方法开发了一个数据集信息资源(DIR),它为特定用户类型收集相关的数据集知识。在当前版本中,我们专门解决初级数据科学家在学习识别、理解和分析医疗保健领域主要数据集方面所面临的挑战。我们强调,DIR不包含来自数据集的实际数据,而是旨在提供有关数据集及其分析的全面知识。
DIR利用语义网技术和W3C数据集描述概要作为知识整合和表示的标准。为了为目标用户提取定制知识,我们开发了从数据集文档中进行手动提取以及使用基于自然语言处理(NLP)的方法从相关出版物中进行半自动提取的方法。提供了一个语义查询组件用于知识检索,并提供了参数化的问答功能以方便搜索。
DIR原型由四个主要组件组成——数据集元数据及相关知识、搜索模块、常见问题解答和博客。当前实现包括有关12个常用的大型复杂医疗保健数据集的信息。基于健康信息学新手的初步使用评估表明,DIR很有帮助且对初学者友好。
我们开发了一种新颖的以用户为导向的DIR,它提供针对目标用户群体的数据集知识。关于数据集的知识在语义网中得到了有效表示。在这个初始阶段,DIR已经能够提供12个数据集的复杂且相关的知识,以帮助入门级健康信息学人员使用合适的数据集学习医疗保健数据分析。内容和功能层面的进一步开发正在进行中。