Department of Basic Medical Sciences, Faculty of Medical Sciences Teaching and Research Complex, The University of the West Indies, Mona, Kingston 7, Jamaica.
Department of Computing, The University of the West Indies, Mona, Kingston, Jamaica.
Behav Res Methods. 2023 Jun;55(4):1818-1838. doi: 10.3758/s13428-022-01895-4. Epub 2022 Jun 29.
The characteristics of big data, including high volume, increased variety, and velocity, pose special challenges for data analysis. As these characteristics generally preclude manual data inspection and processing, researchers must often use computational methodologies to deal with this type of data; techniques that may be unfamiliar to nonspecialists, including behavioral scientists. However, previous data analytics methodologies within the field of computer science, developed to handle the generic tasks of data collection, preprocessing, and analysis, can be appropriated for use in other disciplines. These methodologies involve a sequential pipeline of quality checks to prepare data sets for analysis and application. Building upon these methodologies, this paper describes the Big Data Quality & Statistical Assurance (BDQSA) model, applicable for researchers in the behavioral sciences. It involves a series of data preprocessing tasks, to achieve data understanding, as well as data screening, cleaning, and transformation. These are followed by a statistical quality phase, which includes extraction of the relevant data subset, type conversions, ensuring sample representativeness when appropriate, and assessing statistical assumptions. The resulting model thereby provides methodological guidance for the preprocessing of behavioral science big data, aimed at ensuring acceptable data quality before analysis is undertaken. Sample R code snippets demonstrating the application of this model are provided throughout the paper.
大数据的特点,包括高容量、多样性增加和速度,对数据分析提出了特殊的挑战。由于这些特点通常排除了手动的数据检查和处理,研究人员必须经常使用计算方法来处理这种类型的数据;这些技术对于非专家来说可能不熟悉,包括行为科学家。然而,计算机科学领域内以前的数据分析方法学是为处理数据收集、预处理和分析的一般任务而开发的,可以被应用于其他学科。这些方法学涉及一系列质量检查,以准备数据集进行分析和应用。在此基础上,本文描述了适用于行为科学研究人员的大数据质量和统计保证(BDQSA)模型。它涉及一系列数据预处理任务,以实现数据理解,以及数据筛选、清理和转换。接下来是一个统计质量阶段,包括提取相关数据子集、类型转换、在适当情况下确保样本代表性以及评估统计假设。由此产生的模型从而为行为科学大数据的预处理提供了方法学指导,旨在在进行分析之前确保可接受的数据质量。本文提供了示例 R 代码片段,演示了该模型的应用。