Colineaux Helene, Lepage Benoit, Chauvin Pierre, Dimeglio Chloe, Delpierre Cyrille, Lefèvre Thomas
EQUITY Team, Centre d'Epidémiologie et de Recherche en Santé des POPulations (CERPOP), Institut National de la Santé et de la Recherche Médicale (INSERM)-Toulouse III University, 37 Allées Jules Guesde, 31062 Toulouse, France.
Epidemiology Department, Toulouse Teaching Hospital, 37 Allées Jules Guesde, 31062 Toulouse, France.
Int J Environ Res Public Health. 2025 Feb 27;22(3):348. doi: 10.3390/ijerph22030348.
Epidemiologists often handle large datasets with numerous variables and are currently seeing a growing wealth of techniques for data analysis, such as machine learning. Critical aspects involve addressing causality, often based on observational data, and dealing with the complex relationships between variables to uncover the overall structure of variable interactions, causal or not. Structure learning (SL) methods aim to automatically or semi-automatically reveal the structure of variables' relationships. The objective of this study is to delineate some of the potential contributions and limitations of structure learning methods when applied to social epidemiology topics and the search for determinants of healthcare system access. We applied SL techniques to a real-world dataset, namely the 2010 wave of the SIRS cohort, which included a sample of 3006 adults from the Paris region, France. Healthcare utilization, encompassing both direct and indirect access to care, was the primary outcome. Candidate determinants included health status, demographic characteristics, and socio-cultural and economic positions. We present two approaches: a non-automated epidemiological method (an initial expert knowledge network and stepwise logistic regression models) and three SL techniques using various algorithms, with and without knowledge constraints. We compared the results based on the presence, direction, and strength of specific links within the produced network. Although the interdependencies and relative strengths identified by both approaches were similar, the SL algorithms detect fewer associations with the outcome than the non-automated method. Relationships between variables were sometimes incorrectly oriented when using a purely data-driven approach. SL algorithms can be valuable in exploratory stages, helping to generate new hypotheses or mining novel databases. However, results should be validated against prior knowledge and supplemented with additional confirmatory analyses.
流行病学家经常处理包含众多变量的大型数据集,目前用于数据分析的技术越来越丰富,比如机器学习。关键方面包括解决因果关系(通常基于观察数据),以及处理变量之间的复杂关系,以揭示变量相互作用的整体结构,无论其是否具有因果关系。结构学习(SL)方法旨在自动或半自动地揭示变量关系的结构。本研究的目的是描述结构学习方法应用于社会流行病学主题以及寻找医疗保健系统可及性的决定因素时的一些潜在贡献和局限性。我们将SL技术应用于一个真实世界的数据集,即2010年SIRS队列研究,该研究包含来自法国巴黎地区的3006名成年人样本。医疗保健利用情况,包括直接和间接获得医疗服务,是主要结果。候选决定因素包括健康状况、人口特征以及社会文化和经济地位。我们提出两种方法:一种非自动化的流行病学方法(初始专家知识网络和逐步逻辑回归模型)以及三种使用不同算法的SL技术,有无知识约束均可。我们根据生成网络中特定链接的存在、方向和强度比较了结果。尽管两种方法确定的相互依赖性和相对强度相似,但SL算法检测到的与结果相关的关联比非自动化方法少。使用纯数据驱动方法时,变量之间的关系有时方向错误。SL算法在探索阶段可能很有价值,有助于生成新假设或挖掘新数据库。然而,结果应根据先验知识进行验证,并辅以额外的验证分析。