Jay Caroline, Harper Simon, Dunlop Ian, Smith Sam, Sufi Shoaib, Goble Carole, Buchan Iain
Information Management Group, School of Computer Science, University of Manchester, Manchester, United Kingdom.
J Med Internet Res. 2016 Jan 14;18(1):e13. doi: 10.2196/jmir.4912.
Data discovery, particularly the discovery of key variables and their inter-relationships, is key to secondary data analysis, and in-turn, the evolving field of data science. Interface designers have presumed that their users are domain experts, and so they have provided complex interfaces to support these "experts." Such interfaces hark back to a time when searches needed to be accurate first time as there was a high computational cost associated with each search. Our work is part of a governmental research initiative between the medical and social research funding bodies to improve the use of social data in medical research.
The cross-disciplinary nature of data science can make no assumptions regarding the domain expertise of a particular scientist, whose interests may intersect multiple domains. Here we consider the common requirement for scientists to seek archived data for secondary analysis. This has more in common with search needs of the "Google generation" than with their single-domain, single-tool forebears. Our study compares a Google-like interface with traditional ways of searching for noncomplex health data in a data archive.
Two user interfaces are evaluated for the same set of tasks in extracting data from surveys stored in the UK Data Archive (UKDA). One interface, Web search, is "Google-like," enabling users to browse, search for, and view metadata about study variables, whereas the other, traditional search, has standard multioption user interface.
Using a comprehensive set of tasks with 20 volunteers, we found that the Web search interface met data discovery needs and expectations better than the traditional search. A task × interface repeated measures analysis showed a main effect indicating that answers found through the Web search interface were more likely to be correct (F1,19=37.3, P<.001), with a main effect of task (F3,57=6.3, P<.001). Further, participants completed the task significantly faster using the Web search interface (F1,19=18.0, P<.001). There was also a main effect of task (F2,38=4.1, P=.025, Greenhouse-Geisser correction applied). Overall, participants were asked to rate learnability, ease of use, and satisfaction. Paired mean comparisons showed that the Web search interface received significantly higher ratings than the traditional search interface for learnability (P=.002, 95% CI [0.6-2.4]), ease of use (P<.001, 95% CI [1.2-3.2]), and satisfaction (P<.001, 95% CI [1.8-3.5]). The results show superior cross-domain usability of Web search, which is consistent with its general familiarity and with enabling queries to be refined as the search proceeds, which treats serendipity as part of the refinement.
The results provide clear evidence that data science should adopt single-field natural language search interfaces for variable search supporting in particular: query reformulation; data browsing; faceted search; surrogates; relevance feedback; summarization, analytics, and visual presentation.
数据发现,尤其是关键变量及其相互关系的发现,是二次数据分析以及数据科学这一不断发展的领域的关键。界面设计师假定其用户是领域专家,因此他们提供了复杂的界面来支持这些“专家”。此类界面可追溯到一个时期,那时搜索需要一次性准确无误,因为每次搜索都伴随着高昂的计算成本。我们的工作是医学和社会研究资助机构之间一项政府研究计划的一部分,旨在改善社会数据在医学研究中的应用。
数据科学的跨学科性质无法对特定科学家的领域专业知识做出假设,这些科学家的兴趣可能涉及多个领域。在此,我们考虑科学家为进行二次分析而查找存档数据的常见需求。这与“谷歌一代”的搜索需求更为相似,而与他们单领域、单工具先辈的搜索需求不同。我们的研究将类似谷歌的界面与在数据存档中搜索非复杂健康数据的传统方式进行了比较。
针对从英国数据存档(UKDA)中存储的调查中提取数据的同一组任务,对两个用户界面进行了评估。一个界面是网络搜索,类似“谷歌”,使用户能够浏览、搜索和查看关于研究变量的元数据,而另一个传统搜索界面则具有标准的多选项用户界面。
通过20名志愿者完成一系列综合任务,我们发现网络搜索界面比传统搜索界面能更好地满足数据发现需求和期望。任务×界面重复测量分析显示出一个主效应,表明通过网络搜索界面找到的答案更可能是正确的(F1,19 = 37.3,P <.001)以及任务的主效应(F3,57 = 6.3,P <.001)。此外,参与者使用网络搜索界面完成任务的速度明显更快(F1,19 = 18.0,P <.001)。还有任务主效应(F2,38 = 4.1,P =.025,采用Greenhouse-Geisser校正)。总体而言,要求参与者对可学习性、易用性和满意度进行评分。配对均值比较显示,网络搜索界面在可学习性(P =.002,95% CI [0.6 - 2.4])、易用性(P <.001, 95% CI [1.2 - 3.2])和满意度(P <.001, 95% CI [i.8 - 3.5])方面的评分显著高于传统搜索界面。结果表明网络搜索具有卓越的跨领域可用性,这与其普遍的熟悉度以及随着搜索进行能够细化查询一致,后者将意外发现视为细化的一部分。
结果提供了明确证据,表明数据科学应采用单字段自然语言搜索界面进行变量搜索,尤其支持:查询重新制定;数据浏览;分面搜索;替代物;相关性反馈;汇总、分析和可视化呈现。