Department of Community Health Sciences, University of Manitoba, Winnipeg, MB, Canada.
Department of Statistics, University of Manitoba, Winnipeg, MB, Canada.
BMC Med Inform Decis Mak. 2024 Feb 2;24(1):33. doi: 10.1186/s12911-024-02416-3.
Smoking is a risk factor for many chronic diseases. Multiple smoking status ascertainment algorithms have been developed for population-based electronic health databases such as administrative databases and electronic medical records (EMRs). Evidence syntheses of algorithm validation studies have often focused on chronic diseases rather than risk factors. We conducted a systematic review and meta-analysis of smoking status ascertainment algorithms to describe the characteristics and validity of these algorithms.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines were followed. We searched articles published from 1990 to 2022 in EMBASE, MEDLINE, Scopus, and Web of Science with key terms such as validity, administrative data, electronic health records, smoking, and tobacco use. The extracted information, including article characteristics, algorithm characteristics, and validity measures, was descriptively analyzed. Sources of heterogeneity in validity measures were estimated using a meta-regression model. Risk of bias (ROB) in the reviewed articles was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 tool.
The initial search yielded 2086 articles; 57 were selected for review and 116 algorithms were identified. Almost three-quarters (71.6%) of algorithms were based on EMR data. The algorithms were primarily constructed using diagnosis codes for smoking-related conditions, although prescription medication codes for smoking treatments were also adopted. About half of the algorithms were developed using machine-learning models. The pooled estimates of positive predictive value, sensitivity, and specificity were 0.843, 0.672, and 0.918 respectively. Algorithm sensitivity and specificity were highly variable and ranged from 3 to 100% and 36 to 100%, respectively. Model-based algorithms had significantly greater sensitivity (p = 0.006) than rule-based algorithms. Algorithms for EMR data had higher sensitivity than algorithms for administrative data (p = 0.001). The ROB was low in most of the articles (76.3%) that underwent the assessment.
Multiple algorithms using different data sources and methods have been proposed to ascertain smoking status in electronic health data. Many algorithms had low sensitivity and positive predictive value, but the data source influenced their validity. Algorithms based on machine-learning models for multiple linked data sources have improved validity.
吸烟是许多慢性疾病的危险因素。已经开发出多种用于基于人群的电子健康数据库(如行政数据库和电子病历 (EMR))的吸烟状况确定算法。算法验证研究的证据综合往往侧重于慢性病而不是危险因素。我们对吸烟状况确定算法进行了系统评价和荟萃分析,以描述这些算法的特征和有效性。
遵循系统评价和荟萃分析的首选报告项目指南。我们使用“有效性”、“行政数据”、“电子健康记录”、“吸烟”和“烟草使用”等关键词,在 EMBASE、MEDLINE、Scopus 和 Web of Science 中搜索了 1990 年至 2022 年发表的文章。提取的信息包括文章特征、算法特征和有效性度量,进行描述性分析。使用元回归模型估计有效性度量中的异质性来源。使用诊断准确性研究质量评估-2 工具评估综述文章的偏倚风险 (ROB)。
最初的搜索产生了 2086 篇文章;选择了 57 篇进行审查,确定了 116 种算法。近四分之三 (71.6%) 的算法基于 EMR 数据。这些算法主要是使用与吸烟相关疾病的诊断代码构建的,尽管也采用了用于吸烟治疗的处方药物代码。大约一半的算法是使用机器学习模型构建的。阳性预测值、敏感性和特异性的汇总估计值分别为 0.843、0.672 和 0.918。算法的敏感性和特异性差异很大,范围分别为 3%至 100%和 36%至 100%。基于模型的算法的敏感性显著更高(p=0.006),而基于规则的算法的敏感性较低。用于 EMR 数据的算法的敏感性高于用于行政数据的算法(p=0.001)。经过评估的大多数文章(76.3%)的 ROB 较低。
已经提出了多种使用不同数据源和方法的算法来确定电子健康数据中的吸烟状况。许多算法的敏感性和阳性预测值较低,但数据源影响了它们的有效性。基于机器学习模型和多个关联数据源的算法提高了有效性。