Univ Angers, CHU Angers, Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S, IRSET-ESTER, SFR ICAT, CAPTV CDC, Angers, France.
Octopize, Nantes, France.
PLoS One. 2024 Jul 31;19(7):e0308063. doi: 10.1371/journal.pone.0308063. eCollection 2024.
Though the rise of big data in the field of occupational health offers new opportunities especially for cross-cutting research, they raise the issue of privacy and security of data, especially when linking sensitive data from the field of insurance, occupational health or compensation claims. We aimed to validate a large, blinded synthesized database developed from the CONSTANCES cohort by comparing associations between three independently selected outcomes, and various exposures.
From the CONSTANCES cohort, a large synthetic dataset was constructed using the avatar method (Octopize) that is agnostic to the data primary or secondary data uses. Three main analyses of interest were chosen to compare associations between the raw and avatar dataset: risk of stroke (any stroke, and subtypes of stroke), risk of knee pain and limitations associated with knee pain. Logistic models were computed, and a qualitative comparison of paired odds ratio (OR) was made.
Both raw and avatar datasets included 162,434 observations and 19 relevant variables. On the 172 paired raw/avatar OR that were computed, including stratified analyses on sex, more than 77% of the comparisons had a OR difference ≤0.5 and less than 7% had a discrepancy in the statistical significance of the associations, with a Cohen's Kappa coefficient of 0.80.
This study shows the flexibility and the multiple usage of a synthetic database created with the avatar method in the particular field of occupational health, which can be shared in open access without risking re-identification and privacy issues and help bring new insights for complex phenomenon like return to work.
尽管大数据在职业健康领域的兴起为跨学科研究提供了新的机会,但它们也引发了数据隐私和安全的问题,尤其是在将保险、职业健康或赔偿索赔领域的敏感数据关联起来时。我们旨在通过比较三个独立选择的结果和各种暴露因素之间的关联,验证从 CONSTANCES 队列开发的大型、盲合成数据库。
使用头像法(Octopize)从 CONSTANCES 队列中构建了一个大型综合数据集,该方法对数据的主要或次要用途是不可知的。选择了三个主要的感兴趣分析来比较原始数据集和头像数据集之间的关联:中风风险(任何中风和中风亚型)、膝盖疼痛风险和与膝盖疼痛相关的限制。计算了逻辑模型,并对配对比值比(OR)进行了定性比较。
原始数据集和头像数据集均包含 162434 个观测值和 19 个相关变量。在计算的 172 对原始/头像 OR 中,包括性别分层分析,超过 77%的比较中 OR 差异≤0.5,不到 7%的比较中关联的统计学意义存在差异,Cohen's Kappa 系数为 0.80。
这项研究表明,使用头像法创建的综合数据库在职业健康这个特定领域具有灵活性和多种用途,可以在开放获取的情况下共享,而不会有重新识别和隐私问题的风险,并有助于为工作回归等复杂现象提供新的见解。