Quantitative Sciences, Flatiron Health, New York, NY.
Division of Health Policy and Management, College of Health Science, Korea University, Seoul, South Korea; Harvard Center for Population & Development Studies, Cambridge, MA.
Ann Epidemiol. 2020 Oct;50:7-14. doi: 10.1016/j.annepidem.2020.08.001. Epub 2020 Aug 12.
Epidemiologic studies often conflate the strength of association with predictive accuracy and build classification models based on arbitrarily selected probability cutoffs without considering the cost of misclassification. We illustrated these common pitfalls by building association, prediction, and classification models using birthweight as an exposure and child mortality and child anthropometric failure as outcomes.
Nationally representative samples of 188,819 and 164,113 children aged less than 5 years across India were used for our analysis of mortality and anthropometric failure, respectively. We assessed outcomes of neonatal, postneonatal, and child mortality as well as stunting, wasting, and underweight. Birthweight was the main exposure. We used adjusted and unadjusted logistic regression models to evaluate association strength, univariable and multivariable logistic regression models trained on 80% of the data using 10-fold cross-validation to evaluate predictive power, and classification models across a series of possible misclassification cost scenarios to evaluate classification accuracy.
Birthweight was strongly associated with five of six outcomes (P < .001), and associations were robust to covariate adjustment. Prediction models evaluated on the test set showed that birthweight was a poor discriminator of all outcomes (area under the curve < 0.62), and that adding birthweight to a multivariable model did not meaningfully improve discrimination. Prediction models for anthropometric failure showed substantially better calibration than prediction models for mortality. Depending on the ratio of false positive (FP) cost to false negative (FN) cost, the probability cutoff that minimized total misclassification cost ranged from 0.116 (cost ratio = 7:93) to 0.706 (cost ratio = 4:1), TPR ranged from 0.999 to 0.004, and PPV ranged from 0.355 to 0.867..
Although birthweight is strongly associated with mortality and anthropometric failure, it is a poor predictor of child health outcomes, highlighting that strong associations do not imply predictive power. We recommend that (1) future research focus on building predictive models for anthropometric failure given their clinical relevance in diagnosing individual cases, and that (2) studies that build classifiers report performance metrics across a range of cutoffs to account for variation in the cost of FPs and FNs.
流行病学研究经常将关联强度与预测准确性混淆,并基于任意选择的概率截断值构建分类模型,而不考虑分类错误的代价。我们通过使用出生体重作为暴露因素,以及儿童死亡率和儿童人体测量失败作为结局,构建关联、预测和分类模型来说明这些常见的陷阱。
我们使用来自印度的分别代表了 188819 名和 164113 名年龄小于 5 岁的儿童的全国代表性样本,对死亡率和人体测量失败进行分析。我们评估了新生儿、围生期和儿童死亡率以及发育迟缓、消瘦和体重不足的结局。出生体重是主要的暴露因素。我们使用调整后的和未调整的逻辑回归模型来评估关联强度,使用 80%的数据通过 10 折交叉验证训练单变量和多变量逻辑回归模型,以评估预测能力,以及在一系列可能的错误分类成本场景下的分类模型,以评估分类准确性。
出生体重与六个结局中的五个(P<0.001)强烈相关,并且关联在协变量调整后仍然稳健。在测试集中评估的预测模型表明,出生体重对所有结局的区分能力都很差(曲线下面积<0.62),并且在多变量模型中添加出生体重并不能显著提高区分能力。人体测量失败的预测模型比死亡率的预测模型具有更好的校准能力。根据假阳性(FP)成本与假阴性(FN)成本的比值,最小化总错误分类成本的概率截断值范围从 0.116(成本比=7:93)到 0.706(成本比=4:1),TPR 范围从 0.999 到 0.004,PPV 范围从 0.355 到 0.867。
尽管出生体重与死亡率和人体测量失败有很强的关联,但它是儿童健康结局的一个较差的预测因素,这表明强关联并不意味着预测能力。我们建议:(1)未来的研究重点是构建人体测量失败的预测模型,因为它们在诊断个体病例方面具有临床相关性;(2)构建分类器的研究报告在一系列截断值下的性能指标,以考虑 FP 和 FN 的成本变化。