Baser Onur, Samayoa Gabriela, Yapar Nehir, Baser Erdem
Graduate School of Public Health, City University of New York, New York, NY, USA.
University of Michigan Medical School, Ann Arbor, Michigan, USA.
J Health Econ Outcomes Res. 2024 Sep 25;11(2):86-94. doi: 10.36469/001c.123645. eCollection 2024.
Although increasing in prevalence, nonalcoholic steatohepatitis (NASH) is often undiagnosed in clinical practice. This study identified patients in the Veterans Affairs (VA) health system who likely had undiagnosed NASH using a machine learning algorithm. From a VA data set of 25 million adult enrollees, the study population was divided into NASH-positive, non-NASH, and at-risk cohorts. We performed a claims data analysis using a machine learning algorithm. To build our model, the study population was randomly divided into an 80% training subset and a 20% testing subset and tested and trained using a cross-validation technique. In addition to the baseline model, a gradient-boosted classification tree, naïve Bayes, and random forest model were created and compared using receiver operator characteristics, area under the curve, and accuracy. The best performing model was retrained on the full 80% training subset and applied to the 20% testing subset to calculate the performance metrics. In total, 4 223 443 patients met the study inclusion criteria, of whom 4903 were positive for NASH and 35 528 were non-NASH patients. The remainder was in the at-risk patient cohort, of which 514 997 patients (12%) were identified as likely to have NASH. Age, obesity, and abnormal liver function tests were the top determinants in assigning NASH probability. Utilization of machine learning to predict NASH allows for wider recognition, timely intervention, and targeted treatments to improve or mitigate disease progression and could be used as an initial screening tool.
尽管非酒精性脂肪性肝炎(NASH)的患病率在不断上升,但在临床实践中往往未被诊断出来。本研究使用机器学习算法在退伍军人事务部(VA)医疗系统中识别出可能患有未被诊断出的NASH的患者。从一个包含2500万成年参保者的VA数据集中,研究人群被分为NASH阳性、非NASH和高危队列。我们使用机器学习算法进行了索赔数据分析。为了构建我们的模型,研究人群被随机分为80%的训练子集和20%的测试子集,并使用交叉验证技术进行测试和训练。除了基线模型外,还创建了梯度提升分类树、朴素贝叶斯和随机森林模型,并使用接收者操作特征、曲线下面积和准确性进行比较。性能最佳的模型在完整的80%训练子集上重新训练,并应用于20%的测试子集以计算性能指标。总共有4223443名患者符合研究纳入标准,其中4903名NASH呈阳性,35528名是非NASH患者。其余的属于高危患者队列,其中514997名患者(12%)被确定可能患有NASH。年龄、肥胖和肝功能异常测试是确定NASH概率的主要决定因素。利用机器学习预测NASH可以实现更广泛的识别、及时干预和针对性治疗,以改善或减轻疾病进展,并可作为一种初始筛查工具。