Dey Devjit, Haque Md Samio, Islam Md Mojahedul, Aishi Umme Iffat, Shammy Sajida Sultana, Mayen Md Sabbir Ahmed, Noor Syed Toukir Ahmed, Uddin Md Jamal
Department of Statistics, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh.
Maternal and Child Health Division, International Centre for Diarrhoeal Disease Research, Bangladesh (Icddr,B), Dhaka, Bangladesh.
BMC Med Res Methodol. 2025 Jan 22;25(1):15. doi: 10.1186/s12874-024-02454-5.
Logistic regression is a useful statistical technique commonly used in many fields like healthcare, marketing, or finance to generate insights from binary outcomes (e.g., sick vs. not sick). However, when applying logistic regression to complex survey data, which includes complex sampling designs, specific methodological issues are often overlooked.
The systematic review extensively searched the PubMed and ScienceDirect databases from January 2015 to December 2021, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines, focusing primarily on the Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS). 810 articles met the inclusion criteria and were included in the analysis. When discussing logistic regression, the review considered multiple methodological problems such as the model adequacy assessment, handling dependence of observations, utilization of complex survey design, dealing with missing values, outliers, and more.
Among the selected articles, the DHS database was used the most (96%), with MICS accounting for only 3%, and both DHS and MICS accounting for 1%. Of these, it was found that only 19.7% of the studies employed multilevel mixed-effects logistic regression to account for data dependencies. Model validation techniques were not reported in 94.8% of the studies with limited uses of the bootstrap, jackknife, and other resampling methods. Moreover, sample weights, PSUs, and strata variables were used together in 40.4% of the articles, and 41.7% of the studies did not use any of these variables, which could have produced biased results. Goodness-of-fit assessments were not mentioned in 75.3% of the articles, and the Hosmer-Lemeshow and likelihood ratio test were the most common among those reported. Furthermore, 95.8% of studies did not mention outliers, and only 41.0% of studies corrected for missing information, while only 2.7% applied imputation techniques.
This systematic review highlights important gaps in the use of logistic regression with complex survey data, such as overlooking data dependencies, survey design, and proper validation techniques, along with neglecting outliers, missing data, and goodness-of-fit assessments, all of which point to the need for clearer methodological standards and more thorough reporting to improve the reliability of results. Future research should focus on consistently following these standards to ensure stronger and more dependable findings.
逻辑回归是一种有用的统计技术,常用于医疗保健、市场营销或金融等许多领域,以从二元结果(例如,患病与未患病)中得出见解。然而,在将逻辑回归应用于复杂的调查数据时,包括复杂的抽样设计,特定的方法学问题常常被忽视。
本系统评价按照系统评价和Meta分析的首选报告项目(PRISMA)2020指南,于2015年1月至2021年12月广泛检索了PubMed和ScienceDirect数据库,主要关注人口与健康调查(DHS)和多指标类集调查(MICS)。810篇文章符合纳入标准并被纳入分析。在讨论逻辑回归时,该评价考虑了多个方法学问题,如模型充分性评估、处理观测值的依赖性、复杂调查设计的利用、处理缺失值、异常值等。
在所选文章中,DHS数据库使用最多(96%),MICS仅占3%,DHS和MICS均使用的占1%。其中,发现只有19.7%的研究采用多级混合效应逻辑回归来考虑数据依赖性。94.8%的研究未报告模型验证技术,对自抽样法、刀切法和其他重抽样方法的使用有限。此外,40.4%的文章同时使用了样本权重、初级抽样单元和分层变量,41.7%的研究未使用这些变量中的任何一个,这可能会产生有偏差的结果。75.3%的文章未提及拟合优度评估,在报告的评估中,Hosmer-Lemeshow检验和似然比检验最为常见。此外,95.8%的研究未提及异常值,只有41.0%的研究对缺失信息进行了校正,而只有2.7%的研究应用了插补技术。
本系统评价突出了在使用逻辑回归分析复杂调查数据时存在的重要差距,如忽视数据依赖性、调查设计和适当的验证技术,以及忽略异常值、缺失数据和拟合优度评估,所有这些都表明需要更明确的方法学标准和更全面的报告,以提高结果的可靠性。未来的研究应专注于始终遵循这些标准,以确保得出更强有力和更可靠的结果。