使用逻辑回归自动识别流行病学数据集中的变量。

Automatic identification of variables in epidemiological datasets using logic regression.

作者信息

Lorenz Matthias W, Abdi Negin Ashtiani, Scheckenbach Frank, Pflug Anja, Bülbül Alpaslan, Catapano Alberico L, Agewall Stefan, Ezhov Marat, Bots Michiel L, Kiechl Stefan, Orth Andreas

机构信息

Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany.

Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany.

出版信息

BMC Med Inform Decis Mak. 2017 Apr 13;17(1):40. doi: 10.1186/s12911-017-0429-1.

DOI:10.1186/s12911-017-0429-1

PMID:28407816

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5390441/

Abstract

BACKGROUND

For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.

METHODS

For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.

RESULTS

In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.

CONCLUSIONS

We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

摘要

背景

对于个体参与者数据（IPD）的荟萃分析，必须将多个数据集转换为一致的格式，例如使用统一的变量名。当需要处理大量数据集时，这可能是一项耗时且容易出错的任务。变量的自动或半自动识别有助于减少工作量并提高数据质量。对于半自动识别，匹配变量识别中的高灵敏度尤为重要，因为这样可以创建软件，该软件针对目标变量提供源变量选择，用户可以从中选择匹配的变量，而错过正确源变量的风险较低。

方法

针对一组目标变量中的每个变量，手动创建了一些简单规则。使用逻辑回归，针对每个目标变量，在一个大型流行病学和临床队列数据库的随机子集中（构建子集）搜索这些规则的最佳布尔组合。在该数据库的第二个子集中（验证子集），对该最佳组合规则进行验证。

结果

在构建样本中，平均分配了41个目标变量，阳性预测值（PPV）为34%，阴性预测值（NPV）为95%。在验证样本中，PPV为33%，而NPV保持在94%。在构建样本中，63%的所有变量的PPV为50%或更低，在验证样本中，71%的所有变量的PPV为50%或更低。

结论

我们证明了逻辑回归在大型流行病学IPD荟萃分析的复杂数据管理任务中的应用是可行的。然而，该算法的性能较差，这可能需要备用策略。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7998/5390441/dc11e6e0d8dd/12911_2017_429_Fig1_HTML.jpg

相似文献

Automatic identification of variables in epidemiological datasets using logic regression.

BMC Med Inform Decis Mak. 2017 Apr 13;17(1):40. doi: 10.1186/s12911-017-0429-1.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Inter- and intra-observer variability analysis of completely automated cIMT measurement software (AtheroEdge™) and its benchmarking against commercial ultrasound scanner and expert Readers.

Comput Biol Med. 2013 Sep;43(9):1261-72. doi: 10.1016/j.compbiomed.2013.06.012. Epub 2013 Jun 26.

Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach.

BMC Med Inform Decis Mak. 2011 May 19;11:33. doi: 10.1186/1472-6947-11-33.

Automated carotid intima-media thickness and its link for prediction of SYNTAX score in Japanese coronary artery disease patients.

Int Angiol. 2013 Jun;32(3):339-48.

Automated carotid IMT measurement and its validation in low contrast ultrasound database of 885 patient Indian population epidemiological study: results of AtheroEdge™ Software.

Int Angiol. 2012 Feb;31(1):42-53.

A machine learning-based approach to prognostic analysis of thoracic transplantations.

Artif Intell Med. 2010 May;49(1):33-42. doi: 10.1016/j.artmed.2010.01.002. Epub 2010 Feb 13.

Clinical risk factors and CT imaging features of carotid atherosclerotic plaques as predictors of new incident carotid ischemic stroke: a retrospective cohort study.

AJNR Am J Neuroradiol. 2013 Feb;34(2):402-9. doi: 10.3174/ajnr.A3228. Epub 2012 Aug 2.

A systematic review and individual patient data meta-analysis of prognostic factors for foot ulceration in people with diabetes: the international research collaboration for the prediction of diabetic foot ulcerations (PODUS).

Health Technol Assess. 2015 Jul;19(57):1-210. doi: 10.3310/hta19570.

Development and Validation of a Predictive Model to Identify Individuals Likely to Have Undiagnosed Chronic Obstructive Pulmonary Disease Using an Administrative Claims Database.

J Manag Care Spec Pharm. 2015 Dec;21(12):1149-59. doi: 10.18553/jmcp.2015.21.12.1149.

引用本文的文献

Using logic regression to characterize extreme heat exposures and their health associations: a time-series study of emergency department visits in Atlanta.

BMC Med Res Methodol. 2021 Apr 26;21(1):87. doi: 10.1186/s12874-021-01278-x.

本文引用的文献

Assessing host-specificity of Escherichia coli using a supervised learning logic-regression-based analysis of single nucleotide polymorphisms in intergenic regions.

Mol Phylogenet Evol. 2015 Nov;92:72-81. doi: 10.1016/j.ympev.2015.06.007. Epub 2015 Jun 23.

A decade of individual participant data meta-analyses: A review of current practice.

Contemp Clin Trials. 2015 Nov;45(Pt A):76-83. doi: 10.1016/j.cct.2015.06.012. Epub 2015 Jun 17.

Preferred Reporting Items for Systematic Review and Meta-Analyses of individual participant data: the PRISMA-IPD Statement.

JAMA. 2015 Apr 28;313(16):1657-65. doi: 10.1001/jama.2015.3656.

Systematic review of methods for individual patient data meta- analysis with binary outcomes.

BMC Med Res Methodol. 2014 Jun 19;14:79. doi: 10.1186/1471-2288-14-79.

Developing and validating risk prediction models in an individual participant data meta-analysis.

BMC Med Res Methodol. 2014 Jan 8;14:3. doi: 10.1186/1471-2288-14-3.

Data harmonization and federated analysis of population-based studies: the BioSHaRE project.

Emerg Themes Epidemiol. 2013 Nov 21;10(1):12. doi: 10.1186/1742-7622-10-12.

Individual participant data meta-analysis for a binary outcome: one-stage or two-stage?

PLoS One. 2013 Apr 9;8(4):e60650. doi: 10.1371/journal.pone.0060650. Print 2013.

Logic regression analysis of association of gene polymorphisms with low HDL: Tehran Lipid and Glucose Study.

Gene. 2013 Jan 25;513(2):278-81. doi: 10.1016/j.gene.2012.10.084. Epub 2012 Nov 10.

SNP-SNP interactions discovered by logic regression explain Crohn's disease genetics.

PLoS One. 2012;7(10):e43035. doi: 10.1371/journal.pone.0043035. Epub 2012 Oct 12.

Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies.

Int J Epidemiol. 2011 Oct;40(5):1314-28. doi: 10.1093/ije/dyr106. Epub 2011 Jul 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用逻辑回归自动识别流行病学数据集中的变量。

Automatic identification of variables in epidemiological datasets using logic regression.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献