Argante Lorenzo, Lonnet Germain, Aris Emmanuel, Whelan Jane
Clinical Statistics, GSK, Siena, Italy.
Real-World Analytics, GSK, Wavre, Belgium.
Digit Health. 2025 Apr 3;11:20552076251331895. doi: 10.1177/20552076251331895. eCollection 2025 Jan-Dec.
Gonorrhea is a sexually transmitted infection (STI) that, untreated, can result in debilitating complications such as pelvic inflammatory disease, pain, and infertility. A minority of cases are diagnosed in STI clinics in the United States. Gonorrhea is often asymptomatic and presumed to be substantially underdiagnosed and/or undertreated.
To generate and compare predictive machine learning (ML) models using administrative claims data to characterize young women in the general United States population who would be most likely to contract gonorrhea.
Data were extracted from the Merative™ MarketScan Commercial and Medicaid databases containing routinely collected administrative claims data. Women aged 16-35 years with two years of continuous observation between 1 January 2017 and 31 December 2018 were included. ML classification models were constructed based on logistic regression and tree-based algorithms.
Models constructed using tree-based algorithms such as XGBoost provided the best discriminatory results, but simpler ridge regressions models with splines also achieved reasonable discrimination, allowing for the identification of population subsets at increased risk of gonorrhea infection. A subset of 0.1% of the population identified by the XGBoost model had a 70-fold higher risk of gonorrhea than the general population. External validation applying the different models on a Medicaid dataset that was not included in developing the original models was checked and deemed acceptable.
The models and methods presented here could facilitate the identification of women at high risk of contracting gonorrhea for whom targeted preventive measures may be most beneficial.
淋病是一种性传播感染(STI),若不治疗,可能导致诸如盆腔炎、疼痛和不孕等使人衰弱的并发症。在美国,少数淋病病例是在性传播感染诊所被诊断出来的。淋病通常没有症状,据推测在很大程度上存在诊断不足和/或治疗不足的情况。
利用行政索赔数据生成并比较预测性机器学习(ML)模型,以描述美国普通人群中最有可能感染淋病的年轻女性特征。
数据从包含常规收集的行政索赔数据的麦利云™市场扫描商业数据库和医疗补助数据库中提取。纳入了在2017年1月1日至2018年12月31日期间有两年连续观察记录的16 - 35岁女性。基于逻辑回归和基于树的算法构建了ML分类模型。
使用诸如XGBoost等基于树的算法构建的模型提供了最佳的区分结果,但带有样条的更简单的岭回归模型也实现了合理的区分,从而能够识别出淋病感染风险增加的人群子集。XGBoost模型确定的占人口0.1%的一个子集感染淋病的风险比普通人群高70倍。在未包含在原始模型开发中的医疗补助数据集上应用不同模型进行的外部验证经过检查并被认为是可接受的。
本文提出的模型和方法有助于识别出最有可能感染淋病的女性,针对这些女性采取有针对性的预防措施可能最为有益。