Baro Emilie, Degoul Samuel, Beuscart Régis, Chazard Emmanuel
Department of Public Health, EA 2694, University of Lille, 1 Place de Verdun, 59045 Lille Cedex, France.
Biomed Res Int. 2015;2015:639021. doi: 10.1155/2015/639021. Epub 2015 Jun 2.
The aim of this study was to provide a definition of big data in healthcare.
A systematic search of PubMed literature published until May 9, 2014, was conducted. We noted the number of statistical individuals (n) and the number of variables (p) for all papers describing a dataset. These papers were classified into fields of study. Characteristics attributed to big data by authors were also considered. Based on this analysis, a definition of big data was proposed.
A total of 196 papers were included. Big data can be defined as datasets with Log(n∗p) ≥ 7. Properties of big data are its great variety and high velocity. Big data raises challenges on veracity, on all aspects of the workflow, on extracting meaningful information, and on sharing information. Big data requires new computational methods that optimize data management. Related concepts are data reuse, false knowledge discovery, and privacy issues.
Big data is defined by volume. Big data should not be confused with data reuse: data can be big without being reused for another purpose, for example, in omics. Inversely, data can be reused without being necessarily big, for example, secondary use of Electronic Medical Records (EMR) data.
本研究旨在给出医疗保健领域大数据的定义。
对截至2014年5月9日发表在PubMed上的文献进行系统检索。我们记录了所有描述数据集的论文中的统计个体数量(n)和变量数量(p)。这些论文按研究领域分类。我们还考虑了作者赋予大数据的特征。基于此分析,提出了大数据的定义。
共纳入196篇论文。大数据可定义为满足Log(n∗p) ≥ 7的数据集。大数据的特性是其多样性大、速度快。大数据在准确性、工作流程的各个方面、提取有意义的信息以及共享信息方面都带来了挑战。大数据需要优化数据管理的新计算方法。相关概念包括数据重用、错误知识发现和隐私问题。
大数据由体量定义。大数据不应与数据重用相混淆:数据可能体量很大但未被用于其他目的,例如在组学中。相反,数据可能被重用但不一定体量很大,例如电子病历(EMR)数据的二次使用。