稀疏数据集下逻辑回归方法中优势比的偏倚。

Bias in Odds Ratios From Logistic Regression Methods With Sparse Data Sets.

机构信息

Department of Biostatistics, Faculty of Medicine, University of Tsukuba.

Graduate School of Comprehensive Human Sciences, University of Tsukuba.

出版信息

J Epidemiol. 2023 Jun 5;33(6):265-275. doi: 10.2188/jea.JE20210089. Epub 2022 Apr 1.

DOI:10.2188/jea.JE20210089

PMID:34565762

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10165217/

Abstract

BACKGROUND

Logistic regression models are widely used to evaluate the association between a binary outcome and a set of covariates. However, when there are few study participants at the outcome and covariate levels, the models lead to bias of the odds ratio (OR) estimated using the maximum likelihood (ML) method. This bias is known as sparse data bias, and the estimated OR can yield impossibly large values because of data sparsity. However, this bias has been ignored in most epidemiological studies.

METHODS

We review several methods for reducing sparse data bias in logistic regression. The primary aim is to evaluate the Bayesian methods in comparison with the classical methods, such as the ML, Firth's, and exact methods using a simulation study. We also apply these methods to a real data set.

RESULTS

Our simulation results indicate that the bias of the OR from the ML, Firth's, and exact methods is considerable. Furthermore, the Bayesian methods with hyper-ɡ prior modeling of the prior covariance matrix for regression coefficients reduced the bias under the null hypothesis, whereas the Bayesian methods with log F-type priors reduced the bias under the alternative hypothesis.

CONCLUSION

The Bayesian methods using log F-type priors and hyper-ɡ prior are superior to the ML, Firth's, and exact methods when fitting logistic models to sparse data sets. The choice of a preferable method depends on the null and alternative hypothesis. Sensitivity analysis is important to understand the robustness of the results in sparse data analysis.

摘要

背景

逻辑回归模型被广泛用于评估二项结局与一组协变量之间的关联。然而，当结局和协变量水平的研究参与者较少时，模型会导致使用最大似然（ML）方法估计的优势比（OR）产生偏差。这种偏差称为稀疏数据偏差，由于数据稀疏性，估计的 OR 可能会产生不可能的大值。然而，这种偏差在大多数流行病学研究中被忽略了。

方法

我们综述了几种用于减少逻辑回归中稀疏数据偏差的方法。主要目的是通过模拟研究评估贝叶斯方法与经典方法（如 ML、Firth 和精确方法）的比较。我们还将这些方法应用于真实数据集。

结果

我们的模拟结果表明，ML、Firth 和精确方法的 OR 偏差相当大。此外，对于回归系数的先验协方差矩阵使用超 g 先验建模的贝叶斯方法在零假设下减少了偏差，而在备择假设下使用对数 F 型先验的贝叶斯方法减少了偏差。

结论

当将逻辑回归模型拟合到稀疏数据集时，使用对数 F 型先验和超 g 先验的贝叶斯方法优于 ML、Firth 和精确方法。选择更优的方法取决于零假设和备择假设。敏感性分析对于理解稀疏数据分析结果的稳健性很重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b94a/10165217/42de3b017054/je-33-265-g001.jpg

相似文献

Bias in Odds Ratios From Logistic Regression Methods With Sparse Data Sets.稀疏数据集下逻辑回归方法中优势比的偏倚。

J Epidemiol. 2023 Jun 5;33(6):265-275. doi: 10.2188/jea.JE20210089. Epub 2022 Apr 1.

Firth's logistic regression with rare events: accurate effect estimates and predictions?针对罕见事件的费思逻辑回归：准确的效应估计与预测？

Stat Med. 2017 Jun 30;36(14):2302-2317. doi: 10.1002/sim.7273. Epub 2017 Mar 12.

Solutions to problems of nonexistence of parameter estimates and sparse data bias in Poisson regression.泊松回归中参数估计不存在问题和稀疏数据偏差的解决方案。

Stat Methods Med Res. 2022 Feb;31(2):253-266. doi: 10.1177/09622802211065405. Epub 2021 Dec 21.

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets.调参还是不调参，小数据集或稀疏数据集的岭 logistic 回归案例研究。

BMC Med Res Methodol. 2021 Sep 30;21(1):199. doi: 10.1186/s12874-021-01374-y.

Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions.逻辑回归及相关分类和生存回归中的惩罚、偏差减少和默认先验

Stat Med. 2015 Oct 15;34(23):3133-43. doi: 10.1002/sim.6537. Epub 2015 May 26.

An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes.探讨惩罚和数据增强对改善聚类二项结局广义估计方程收敛性的作用。

BMC Med Res Methodol. 2022 Jun 9;22(1):168. doi: 10.1186/s12874-022-01641-6.

Decreased Susceptibility of Marginal Odds Ratios to Finite-sample Bias.边际优势比有限样本偏差易感性降低。

Epidemiology. 2021 Sep 1;32(5):648-652. doi: 10.1097/EDE.0000000000001370.

Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets.具有小数据集或稀疏数据集的偏倚降低和分离证明的条件逻辑回归。

Stat Med. 2010 Mar 30;29(7-8):770-7. doi: 10.1002/sim.3794.

On estimation for accelerated failure time models with small or rare event survival data.小样本或稀有事件生存数据的加速失效时间模型估计。

BMC Med Res Methodol. 2022 Jun 11;22(1):169. doi: 10.1186/s12874-022-01638-1.

No rationale for 1 variable per 10 events criterion for binary logistic regression analysis.二元逻辑回归分析中每10个事件对应1个变量的标准没有理论依据。

BMC Med Res Methodol. 2016 Nov 24;16(1):163. doi: 10.1186/s12874-016-0267-3.

引用本文的文献

Study on medical professionals' acceptance of and factors influencing drone delivery for medical supplies.医疗专业人员对医疗用品无人机配送的接受度及影响因素研究。

Front Public Health. 2025 Jun 3;13:1571904. doi: 10.3389/fpubh.2025.1571904. eCollection 2025.

carriage in adults during the COVID-19 pandemic in Portugal: dominance of serotypes included in broader PCVs and of serotype 3.葡萄牙新冠疫情期间成人中的携带情况：更广泛的肺炎球菌结合疫苗（PCV）所包含血清型及血清型3占主导地位

mSphere. 2025 Jul 29;10(7):e0008225. doi: 10.1128/msphere.00082-25. Epub 2025 Jun 10.

Instability of estimation results based on caliper matching with propensity scores.基于卡尺匹配倾向得分的估计结果的不稳定性。

PLoS One. 2025 Jun 6;20(6):e0325317. doi: 10.1371/journal.pone.0325317. eCollection 2025.

Type 2 diabetes polygenic risk score interactions with lifestyle risk factors in Black Americans.2型糖尿病多基因风险评分与美国黑人生活方式风险因素的相互作用。

Lifestyle Genom. 2025 May 30:1-14. doi: 10.1159/000546100.

HLA Polymorphisms Linked to the Severity and Extent of Periodontitis in Patients with Type 1 Diabetes from a Brazilian Mixed Population.来自巴西混合人群的1型糖尿病患者中与牙周炎严重程度和范围相关的HLA多态性

Int J Environ Res Public Health. 2025 Mar 27;22(4):512. doi: 10.3390/ijerph22040512.

A Simple MRI Score Predicts Pathological General Movements in Very Preterm Infants with Brain Injury-Retrospective Cohort Study.一种简单的MRI评分可预测脑损伤极早产儿的病理全身运动——回顾性队列研究

Children (Basel). 2024 Aug 30;11(9):1067. doi: 10.3390/children11091067.

Population-Based Influenza Vaccine Effectiveness Against Laboratory-Confirmed Influenza Infection in Southern China, 2023-2024 Season.2023 - 2024年中国南方地区基于人群的流感疫苗对实验室确诊流感感染的有效性

Open Forum Infect Dis. 2024 Aug 21;11(9):ofae456. doi: 10.1093/ofid/ofae456. eCollection 2024 Sep.

Caveats of Covariate Adjustment in Disproportionality Analysis for Best Practices.最佳实践中不成比例分析协变量调整的注意事项。

Drug Saf. 2025 Jan;48(1):1-5. doi: 10.1007/s40264-024-01473-x. Epub 2024 Aug 17.

Epidemiological features of suicidal ideation among the elderly in China based meta-analysis.基于荟萃分析的中国老年人自杀意念的流行病学特征。

BMC Psychiatry. 2024 Aug 17;24(1):562. doi: 10.1186/s12888-024-06010-9.

Age at natural menopause and development of chronic diseases in the female population of Kharameh, Iran: A historical cohort study.伊朗哈拉梅女性人群自然绝经年龄与慢性病发展：一项历史性队列研究。

Health Sci Rep. 2024 Apr 21;7(4):e2042. doi: 10.1002/hsr2.2042. eCollection 2024 Apr.

本文引用的文献

Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes.建立多变量预测模型的最小样本量：第二部分 - 二分类和生存数据。

Stat Med. 2019 Mar 30;38(7):1276-1296. doi: 10.1002/sim.7992. Epub 2018 Oct 24.

Sample size for binary logistic prediction models: Beyond events per variable criteria.二项逻辑预测模型的样本量：超越变量标准的事件数。

Stat Methods Med Res. 2019 Aug;28(8):2455-2474. doi: 10.1177/0962280218784726. Epub 2018 Jul 3.

Separation in Logistic Regression: Causes, Consequences, and Control.逻辑回归中的分离：原因、后果与控制。

Am J Epidemiol. 2018 Apr 1;187(4):864-870. doi: 10.1093/aje/kwx299.

Increased risk of thromboembolic events in adult congenital heart disease patients with atrial tachyarrhythmias: Bias due to the data sparsity.患有房性快速心律失常的成人先天性心脏病患者发生血栓栓塞事件的风险增加：数据稀疏导致的偏差。

Int J Cardiol. 2017 Jul 15;239:20. doi: 10.1016/j.ijcard.2017.02.133.

Firth's logistic regression with rare events: accurate effect estimates and predictions?针对罕见事件的费思逻辑回归：准确的效应估计与预测？

Stat Med. 2017 Jun 30;36(14):2302-2317. doi: 10.1002/sim.7273. Epub 2017 Mar 12.

Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data.Firth 法和对数 F 型惩罚方法在小样本或稀疏二元数据风险预测中的性能

BMC Med Res Methodol. 2017 Feb 23;17(1):33. doi: 10.1186/s12874-017-0313-9.

Increased risk of thromboembolic events in adult congenital heart disease patients with atrial tachyarrhythmias.患有房性快速性心律失常的成人先天性心脏病患者发生血栓栓塞事件的风险增加。

Int J Cardiol. 2017 May 1;234:69-75. doi: 10.1016/j.ijcard.2017.02.004. Epub 2017 Feb 5.

Adaptive prior weighting in generalized regression.广义回归中的自适应先验加权

Biometrics. 2017 Mar;73(1):242-251. doi: 10.1111/biom.12541. Epub 2016 May 18.

Sparse data bias: a problem hiding in plain sight.稀疏数据偏差：一个隐藏在显而易见之处的问题。

BMJ. 2016 Apr 27;352:i1981. doi: 10.1136/bmj.i1981.

Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions.逻辑回归及相关分类和生存回归中的惩罚、偏差减少和默认先验

Stat Med. 2015 Oct 15;34(23):3133-43. doi: 10.1002/sim.6537. Epub 2015 May 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

稀疏数据集下逻辑回归方法中优势比的偏倚。

Bias in Odds Ratios From Logistic Regression Methods With Sparse Data Sets.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献