Wang Bo, Sheu Yi-Han, Lee Hyunjoon, Mealer Robert G, Castro Victor M, Smoller Jordan W
Center for Precision Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
J Child Psychol Psychiatry. 2025 Aug;66(8):1141-1154. doi: 10.1111/jcpp.14131. Epub 2025 Feb 18.
Early identification of bipolar disorder (BD) provides an important opportunity for timely intervention. In this study, we aimed to develop machine learning models using large-scale electronic health record (EHR) data including clinical notes for predicting early-onset BD.
Structured and unstructured data were extracted from the longitudinal EHR of the Mass General Brigham health system. We defined three cohorts aged 10-25 years: (1) the full youth cohort (N = 300,398); (2) a subcohort defined by having a mental health visit (N = 105,461); and (3) a subcohort defined by having a diagnosis of mood disorder or ADHD (N = 35,213). By adopting a prospective landmark modeling approach that aligns with clinical practice, we developed and validated a range of machine learning models, across different cohorts and prediction windows.
We found the two tree-based models, random forests (RF) and light gradient-boosting machine (LGBM), achieving good discriminative performance across different clinical settings (area under the receiver operating characteristic curve 0.76-0.88 for RF and 0.74-0.89 for LGBM). In addition, we showed comparable performance can be achieved with a greatly reduced set of features, demonstrating computational efficiency can be attained without significant compromise of model accuracy.
Good discriminative performance for models predicting early-onset BD can be achieved utilizing large-scale EHR data. Our study offers a scalable and accurate method for identifying youth at risk for BD that could help inform clinical decision-making and facilitate early intervention. Future work includes evaluating the portability of our approach to other healthcare systems and exploring considerations regarding possible implementation.
双相情感障碍(BD)的早期识别为及时干预提供了重要契机。在本研究中,我们旨在利用包括临床记录在内的大规模电子健康记录(EHR)数据开发机器学习模型,以预测早发性双相情感障碍。
从麻省总医院布莱根健康系统的纵向EHR中提取结构化和非结构化数据。我们定义了三个年龄在10 - 25岁的队列:(1)全青年队列(N = 300,398);(2)由有心理健康就诊记录定义的亚队列(N = 105,461);以及(3)由诊断为情绪障碍或注意力缺陷多动障碍(ADHD)定义的亚队列(N = 35,213)。通过采用与临床实践一致的前瞻性标志性建模方法,我们在不同队列和预测窗口中开发并验证了一系列机器学习模型。
我们发现基于树的两种模型,随机森林(RF)和轻梯度提升机(LGBM),在不同临床环境中均具有良好的判别性能(RF的受试者操作特征曲线下面积为0.76 - 0.88,LGBM为0.74 - 0.89)。此外,我们表明使用大幅减少的特征集也能实现相当的性能,这表明在不显著损害模型准确性的情况下可以实现计算效率。
利用大规模EHR数据可以实现预测早发性双相情感障碍模型的良好判别性能。我们的研究提供了一种可扩展且准确的方法来识别有双相情感障碍风险的青少年,这有助于为临床决策提供信息并促进早期干预。未来的工作包括评估我们方法在其他医疗系统中的可移植性,以及探索关于可能实施的考虑因素。