Peek N, Holmes J H, Sun J
Niels Peek, Centre for Health Informatics, The University of Manchester, Vaughan House, Portsmouth Street, Manchester M13 9GB, United Kingdom, E-mail:
Yearb Med Inform. 2014 Aug 15;9(1):42-7. doi: 10.15265/IY-2014-0018.
To review technical and methodological challenges for big data research in biomedicine and health.
We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data.
The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets.
The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.
回顾生物医学与健康领域大数据研究中的技术和方法挑战。
我们讨论大数据集的来源,调查大数据存储和大数据处理的基础设施,并描述分析大数据时出现的主要挑战。
生命科学和生物医学通过对常规护理期间收集的数据进行二次利用以及通过社交媒体等新数据源,为大数据革命做出了巨大贡献。高效处理大数据集通常是通过在一组计算机上分布计算来实现的。数据分析师应意识到与大数据相关的陷阱,如常规护理数据中的偏差以及高维数据集中假阳性结果的风险。
在不久的将来,主要挑战是改造生物医学与健康领域所使用的分析方法,以适应处理大数据所需的分布式存储和处理模型,同时确保所分析数据的保密性。