Department of Community Health Sciences, University of Calgary, Calgary, Canada.
Applied Research and Evaluation- Primary Health Care, Alberta Health Services, Calgary, Canada.
Int J Popul Data Sci. 2021 Nov 30;6(1):1680. doi: 10.23889/ijpds.v6i1.1680. eCollection 2021.
Data pooling from pre-existing datasets can be useful to increase study sample size and statistical power in order to answer a research question. However, individual datasets may contain variables that measure the same construct differently, posing challenges for data pooling. Variable harmonization, an approach that can generate comparable datasets from heterogeneous sources, can address this issue in some circumstances. As an illustrative example, this paper describes the data harmonization strategies that helped generate comparable datasets across two Canadian pregnancy cohort studies: All Our Families; and the Alberta Pregnancy Outcomes and Nutrition. Variables were harmonized considering multiple features across the datasets: the construct measured; question asked/response options; the measurement scale used; the frequency of measurement; timing of measurement, and the data structure. Completely matching, partially matching, and completely un-matching variables across the datasets were determined based on these features. Variables that were an exact match were pooled as is. Partially matching variables were harmonized or processed under a common format across the datasets considering the frequency of measurement, the timing of measurement, the measurement scale used, and response options. Variables that were completely unmatching could not be harmonized into a single variable. The variable harmonization strategies that were used to generate comparable cohort datasets for data pooling are applicable to other data sources. Future studies may employ or evaluate these strategies, which permit researchers to answer novel research questions in a statistically efficient, timely, and cost-efficient manner that could not be achieved using a single data source.
从现有的数据集进行数据汇集,可以增加研究样本量和统计效力,从而回答研究问题。然而,各个数据集可能包含以不同方式测量同一结构的变量,这给数据汇集带来了挑战。变量协调是一种可以从异构来源生成可比数据集的方法,可以在某些情况下解决这个问题。本文以两个加拿大妊娠队列研究为例,描述了有助于生成可比数据集的数据协调策略:All Our Families 和 Alberta Pregnancy Outcomes and Nutrition。变量根据数据集之间的多个特征进行协调:测量的结构;提出的问题/响应选项;使用的测量尺度;测量的频率;测量的时间以及数据结构。根据这些特征,确定了数据集之间完全匹配、部分匹配和完全不匹配的变量。完全匹配的变量按原样进行汇集。部分匹配的变量根据测量频率、测量时间、使用的测量尺度和响应选项在数据集之间以通用格式进行协调或处理。完全不匹配的变量无法协调成单个变量。用于生成可用于数据汇集的可比队列数据集的数据协调策略适用于其他数据源。未来的研究可以采用或评估这些策略,使研究人员能够以统计上有效、及时和具有成本效益的方式回答新的研究问题,而这些问题无法通过单个数据源来实现。