Liu Dianbo, Choi Karmel W, Lizano Paulo, Yuan William, Yu Kun-Hsing, Smoller Jordan, Kohane Isaac
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA; School of Medicine, National University of Singapore, Singapore; College of Design and Engineering, National University of Singapore, Singapore.
Center for Human Genetics Research, Massachusetts General Hospital, Boston, MA, USA.
Schizophr Res. 2025 Sep;283:59-66. doi: 10.1016/j.schres.2025.06.024. Epub 2025 Jul 7.
The prevalence of severe mental illnesses (SMIs) in the United States is approximately 3 % of the whole population. The ability to conduct risk screening of SMIs at large scale could inform early prevention and treatment.
A scalable machine learning based tool was developed to conduct population-level risk screening for SMIs, including schizophrenia, schizoaffective disorders, psychosis, and bipolar disorders, using 1) healthcare insurance claims and 2) electronic health records (EHRs).
DESIGN, SETTING AND PARTICIPANTS: Data from beneficiaries from a nationwide commercial healthcare insurer with 77.4 million members and data from patients from EHRs from eight academic hospitals based in the U.S. were used. First, the predictive models were constructed and tested using data in case-control cohorts from insurance claims or EHR data. Second, performance of the predictive models across data sources was analyzed. Third, as an illustrative application, the models were further trained to predict risks of SMIs among 18-year old young adults and individuals with substance associated conditions.
Machine learning-based predictive models for SMIs in the general population were built based on insurance claims and EHR.
A total of 301,221 patients with SMIs and 2,439,890 control individuals were retrieved from the nationwide health insurance claim database in the U.S. A total of 59,319 patients with SMIs and 297,993 control individuals were retrieved from EHRs spanning eight different hospitals from a major integrated healthcare system in Massachusetts, U.S. The obtained predictive models for SMIs achieved AUCROC of 0.76, specificity of 79.1 % and sensitivity of 61.9 % on an independent test set of an all-age case-control cohort from insurance claim data, and AUCROC of 0.83, specificity of 85.1 % and sensitivity of 66.4 % using EHR data. The fine-tuned models for specific use case scenarios outperformed two rule based benchmark methods when predicting 12-month risk of SMIs among 18-year old young adults but had inferior performance to benchmark methods when predicting SMIs among individuals with substance associated conditions in claims data.
Performance of our SMI prediction models constructed using health insurance claims or EHR data suggest feasibility of using real world healthcare data for large scale screening of SMIs in the general population. In addition, our analysis showed cross data source generalizability of machine learning models trained on real world healthcare data. Models constructed from insurance claims appear to be transferable to EHR cohorts and vice versa.
在美国,严重精神疾病(SMI)的患病率约为总人口的3%。大规模开展SMI风险筛查的能力可为早期预防和治疗提供依据。
开发一种基于机器学习的可扩展工具,利用1)医疗保险理赔数据和2)电子健康记录(EHR),对包括精神分裂症、分裂情感性障碍、精神病和双相情感障碍在内的SMI进行人群层面的风险筛查。
设计、设置和参与者:使用了来自一家拥有7740万会员的全国性商业医疗保险机构受益人的数据,以及来自美国八家学术医院的EHR患者数据。首先,使用保险理赔或EHR数据中的病例对照队列数据构建并测试预测模型。其次,分析预测模型在不同数据源中的性能。第三,作为一个示例应用,对模型进行进一步训练,以预测18岁青年人和患有物质相关疾病个体的SMI风险。
基于保险理赔和EHR建立了一般人群中SMI的基于机器学习的预测模型。
从美国全国健康保险理赔数据库中检索到301221例SMI患者和2439890例对照个体。从美国马萨诸塞州一个主要综合医疗系统的八家不同医院的EHR中检索到59319例SMI患者和297993例对照个体。在一个来自保险理赔数据的全年龄病例对照队列的独立测试集中,所获得的SMI预测模型的AUCROC为0.76,特异性为79.1%,敏感性为61.9%;使用EHR数据时,AUCROC为0.83,特异性为85.1%,敏感性为66.4%。在预测18岁青年人群体中SMI的12个月风险时,针对特定用例场景的微调模型优于两种基于规则的基准方法,但在预测理赔数据中患有物质相关疾病个体的SMI时,其性能不如基准方法。
我们使用健康保险理赔或EHR数据构建的SMI预测模型的性能表明,利用真实世界医疗数据对一般人群进行大规模SMI筛查是可行的。此外,我们的分析表明,在真实世界医疗数据上训练的机器学习模型具有跨数据源的通用性。从保险理赔构建的模型似乎可以转移到EHR队列,反之亦然。