Janssen Research and Development, Titusville, NJ, USA.
Erasmus MC, Rotterdam, the Netherlands.
J Biomed Inform. 2019 Sep;97:103264. doi: 10.1016/j.jbi.2019.103264. Epub 2019 Aug 3.
Smoking status is poorly record in US claims data. IBM MarketScan Commercial is a claims database that can be linked to an additional health risk assessment with self-reported smoking status for a subset of 1,966,174 patients. We investigate whether this subset could be used to learn a smoking status phenotype model generalizable to all US claims data that calculates the probability of being a current smoker.
251,643 (12.8%) had self-reported their smoking status as 'current smoker'. A regularized logistic regression model, the Current Risk of Smoking Status (CROSS), was trained using the subset of patients with self-reported smoking status. CROSS considered 53,027 candidate covariates including demographics and conditions/drugs/measurements/procedures/observations recorded in the prior 365 days, The CROSS phenotype model was validated across multiple other claims data.
The internal validation showed the CROSS model achieved an area under the receiver operating characteristic curve (AUC) of 0.76 and the calibration plots indicated it was well calibrated. The external validation across three US claims databases obtained AUCs ranging between 0.82 and 0.87 showing the model appears to be transportable across Claims data.
CROSS predicts current smoking status based on the claims records in the prior year. CROSS can be readily implemented to any US insurance claims mapped to the OMOP common data model and will be a useful way to impute smoking status when conducting epidemiology studies where smoking is a known confounder but smoking status is not recorded. CROSS is available from https://github.com/OHDSI/StudyProtocolSandbox/tree/master/SmokingModel.
美国索赔数据中吸烟状况记录不佳。IBM MarketScan 商业是一个索赔数据库,它可以与一小部分(1966174 名患者中的)自我报告的吸烟状况的额外健康风险评估相链接。我们调查了该子集是否可以用于学习吸烟状况表型模型,该模型可推广到所有美国索赔数据,计算成为当前吸烟者的概率。
251643 人(12.8%)自我报告为“当前吸烟者”。使用具有自我报告吸烟状况的患者子集,使用正则化逻辑回归模型(CROSS)训练了 CROSS。CROSS 考虑了 53027 个候选协变量,包括在过去 365 天内记录的人口统计学数据和疾病/药物/测量/程序/观察结果。CROSS 表型模型在多个其他索赔数据中进行了验证。
内部验证表明,CROSS 模型的接受者操作特征曲线(AUC)为 0.76,校准图表明其校准良好。在三个美国索赔数据库的外部验证中,获得的 AUC 范围在 0.82 到 0.87 之间,表明该模型似乎可以在索赔数据中进行传输。
CROSS 根据前一年的索赔记录预测当前吸烟状况。CROSS 可以轻松地实现到任何映射到 OMOP 通用数据模型的美国保险索赔,并且在进行流行病学研究时,这将是一种有用的方法,因为吸烟是已知的混杂因素,但未记录吸烟状况。CROSS 可从 https://github.com/OHDSI/StudyProtocolSandbox/tree/master/SmokingModel 获得。