Wheeler David C, Archer Kellie J, Burstyn Igor, Yu Kai, Stewart Patricia A, Colt Joanne S, Baris Dalsu, Karagas Margaret R, Schwenn Molly, Johnson Alison, Armenti Karla, Silverman Debra T, Friesen Melissa C
1.Department of Biostatistics, School of Medicine, Virginia Commonwealth University, 830 East Main Street, Richmond, VA 23298, USA
1.Department of Biostatistics, School of Medicine, Virginia Commonwealth University, 830 East Main Street, Richmond, VA 23298, USA.
Ann Occup Hyg. 2015 Apr;59(3):324-35. doi: 10.1093/annhyg/meu098. Epub 2014 Nov 27.
OBJECTIVES: To evaluate occupational exposures in case-control studies, exposure assessors typically review each job individually to assign exposure estimates. This process lacks transparency and does not provide a mechanism for recreating the decision rules in other studies. In our previous work, nominal (unordered categorical) classification trees (CTs) generally successfully predicted expert-assessed ordinal exposure estimates (i.e. none, low, medium, high) derived from occupational questionnaire responses, but room for improvement remained. Our objective was to determine if using recently developed ordinal CTs would improve the performance of nominal trees in predicting ordinal occupational diesel exhaust exposure estimates in a case-control study. METHODS: We used one nominal and four ordinal CT methods to predict expert-assessed probability, intensity, and frequency estimates of occupational diesel exhaust exposure (each categorized as none, low, medium, or high) derived from questionnaire responses for the 14983 jobs in the New England Bladder Cancer Study. To replicate the common use of a single tree, we applied each method to a single sample of 70% of the jobs, using 15% to test and 15% to validate each method. To characterize variability in performance, we conducted a resampling analysis that repeated the sample draws 100 times. We evaluated agreement between the tree predictions and expert estimates using Somers' d, which measures differences in terms of ordinal association between predicted and observed scores and can be interpreted similarly to a correlation coefficient. RESULTS: From the resampling analysis, compared with the nominal tree, an ordinal CT method that used a quadratic misclassification function and controlled tree size based on total misclassification cost had a slightly better predictive performance that was statistically significant for the frequency metric (Somers' d: nominal tree = 0.61; ordinal tree = 0.63) and similar performance for the probability (nominal = 0.65; ordinal = 0.66) and intensity (nominal = 0.65; ordinal = 0.65) metrics. The best ordinal CT predicted fewer cases of large disagreement with the expert assessments (i.e. no exposure predicted for a job with high exposure and vice versa) compared with the nominal tree across all of the exposure metrics. For example, the percent of jobs with expert-assigned high intensity of exposure that the model predicted as no exposure was 29% for the nominal tree and 22% for the best ordinal tree. CONCLUSIONS: The overall agreements were similar across CT models; however, the use of ordinal models reduced the magnitude of the discrepancy when disagreements occurred. As the best performing model can vary by situation, researchers should consider evaluating multiple CT methods to maximize the predictive performance within their data.
目的:在病例对照研究中评估职业暴露时,暴露评估者通常会逐一审查每份工作以确定暴露估计值。这一过程缺乏透明度,且未提供在其他研究中重现决策规则的机制。在我们之前的工作中,名义(无序分类)分类树(CTs)通常能成功预测从职业问卷回答中得出的专家评估的序数暴露估计值(即无、低、中、高),但仍有改进空间。我们的目的是确定在病例对照研究中,使用最近开发的序数分类树是否能提高名义分类树在预测序数职业柴油废气暴露估计值方面的性能。 方法:我们使用一种名义分类树方法和四种序数分类树方法,来预测从新英格兰膀胱癌研究中14983份工作的问卷回答得出的专家评估的职业柴油废气暴露的概率、强度和频率估计值(每种估计值都分为无、低、中或高)。为了模拟单一分类树的常见用法,我们将每种方法应用于70%工作的单个样本,使用15%进行测试,15%进行验证。为了描述性能的变异性,我们进行了重采样分析,重复采样100次。我们使用Somers'd评估分类树预测与专家估计之间的一致性,Somers'd衡量预测分数与观察分数之间序数关联方面的差异,其解释类似于相关系数。 结果:从重采样分析来看,与名义分类树相比,一种使用二次错误分类函数并基于总错误分类成本控制树大小的序数分类树方法具有稍好的预测性能,对于频率指标具有统计学意义(Somers'd:名义分类树 = 0.61;序数分类树 = 0.63),对于概率(名义分类树 = 0.65;序数分类树 = 0.66)和强度(名义分类树 = 0.65;序数分类树 = 0.65)指标具有相似的性能。与名义分类树相比,在所有暴露指标上,最佳序数分类树预测的与专家评估有较大差异的情况(即对高暴露工作预测为无暴露,反之亦然)更少。例如,对于专家评定为高暴露强度的工作,名义分类树预测为无暴露的工作百分比为29%,最佳序数分类树为22%。 结论:各分类树模型的总体一致性相似;然而,使用序数模型可减少出现差异时的差异程度。由于最佳性能模型可能因情况而异,研究人员应考虑评估多种分类树方法,以在其数据范围内最大化预测性能。
Ann Work Expo Health. 2019-10-11
Res Rep Health Eff Inst. 2010-12
Curr Environ Health Rep. 2019-9
Ann Occup Hyg. 2016-5
Occup Environ Med. 2013-10-24
Occup Environ Med. 2012-1-2
Occup Environ Med. 2010-9-23