Lorenz Matthias W, Abdi Negin Ashtiani, Scheckenbach Frank, Pflug Anja, Bülbül Alpaslan, Catapano Alberico L, Agewall Stefan, Ezhov Marat, Bots Michiel L, Kiechl Stefan, Orth Andreas
Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany.
Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany.
BMC Med Inform Decis Mak. 2017 Apr 13;17(1):40. doi: 10.1186/s12911-017-0429-1.
For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.
For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.
In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.
We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
对于个体参与者数据(IPD)的荟萃分析,必须将多个数据集转换为一致的格式,例如使用统一的变量名。当需要处理大量数据集时,这可能是一项耗时且容易出错的任务。变量的自动或半自动识别有助于减少工作量并提高数据质量。对于半自动识别,匹配变量识别中的高灵敏度尤为重要,因为这样可以创建软件,该软件针对目标变量提供源变量选择,用户可以从中选择匹配的变量,而错过正确源变量的风险较低。
针对一组目标变量中的每个变量,手动创建了一些简单规则。使用逻辑回归,针对每个目标变量,在一个大型流行病学和临床队列数据库的随机子集中(构建子集)搜索这些规则的最佳布尔组合。在该数据库的第二个子集中(验证子集),对该最佳组合规则进行验证。
在构建样本中,平均分配了41个目标变量,阳性预测值(PPV)为34%,阴性预测值(NPV)为95%。在验证样本中,PPV为33%,而NPV保持在94%。在构建样本中,63%的所有变量的PPV为50%或更低,在验证样本中,71%的所有变量的PPV为50%或更低。
我们证明了逻辑回归在大型流行病学IPD荟萃分析的复杂数据管理任务中的应用是可行的。然而,该算法的性能较差,这可能需要备用策略。