Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.
Department of Neurology, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands.
PLoS One. 2020 May 12;15(5):e0232970. doi: 10.1371/journal.pone.0232970. eCollection 2020.
Pooling individual participant data to enable pooled analyses is often complicated by diversity in variables across available datasets. Therefore, recoding original variables is often necessary to build a pooled dataset. We aimed to quantify how much information is lost in this process and to what extent this jeopardizes validity of analyses results.
Data were derived from a platform that was developed to pool data from three randomized controlled trials on the effect of treatment of cardiovascular risk factors on cognitive decline or dementia. We quantified loss of information using the R-squared of linear regression models with pooled variables as a function of their original variable(s). In case the R-squared was below 0.8, we additionally explored the potential impact of loss of information for future analyses. We did this second step by comparing whether the Beta coefficient of the predictor differed more than 10% when adding original or recoded variables as a confounder in a linear regression model. In a simulation we randomly sampled numbers, recoded those < = 1000 to 0 and those >1000 to 1 and varied the range of the continuous variable, the ratio of recoded zeroes to recoded ones, or both, and again extracted the R-squared from linear models to quantify information loss.
The R-squared was below 0.8 for 8 out of 91 recoded variables. In 4 cases this had a substantial impact on the regression models, particularly when a continuous variable was recoded into a discrete variable. Our simulation showed that the least information is lost when the ratio of recoded zeroes to ones is 1:1.
Large, pooled datasets provide great opportunities, justifying the efforts for data harmonization. Still, caution is warranted when using recoded variables which variance is explained limitedly by their original variables as this may jeopardize the validity of study results.
为了实现汇总分析,将个体参与者数据汇总通常会因可用数据集之间变量的多样性而变得复杂。因此,通常需要对原始变量进行重新编码以构建汇总数据集。我们旨在量化在此过程中丢失了多少信息,以及这在多大程度上危及分析结果的有效性。
数据来自一个平台,该平台旨在汇总三项关于治疗心血管风险因素对认知能力下降或痴呆影响的随机对照试验的数据。我们通过将汇总变量的线性回归模型的 R 平方作为其原始变量的函数来量化信息丢失。如果 R 平方低于 0.8,我们还会额外探讨信息丢失对未来分析的潜在影响。我们通过比较在添加原始或重新编码变量作为线性回归模型中的混杂因素时,预测因子的 Beta 系数是否相差超过 10%来进行第二步。在模拟中,我们随机抽样数字,将小于等于 1000 的数字重新编码为 0,将大于 1000 的数字重新编码为 1,并改变连续变量的范围、重新编码的 0 与重新编码的 1 的比例,或者同时改变这两个比例,然后从线性模型中提取 R 平方以量化信息丢失。
91 个重新编码变量中有 8 个的 R 平方低于 0.8。在 4 种情况下,这对回归模型产生了重大影响,尤其是当连续变量被重新编码为离散变量时。我们的模拟表明,当重新编码的 0 与 1 的比例为 1:1 时,信息丢失最少。
大型汇总数据集提供了巨大的机会,证明了数据协调的努力是合理的。但是,当使用原始变量解释其方差有限的重新编码变量时,需要谨慎,因为这可能会危及研究结果的有效性。