Chen Zhenghua, Lasserre Patricia, Lin Angela, Rajapakshe Rasika
BC Cancer Kelowna, Kelowna, Canada.
Computer Science, University of British Columbia-Okanagan, Kelowna, Canada.
JCO Clin Cancer Inform. 2025 Jul;9:e2400317. doi: 10.1200/CCI-24-00317. Epub 2025 Jul 23.
Social Determinants of Health (SDoH) have a significant effect on health outcomes and inequalities. SDoH can be extracted from electronic health records (EHR) to aid policy development and research to improve population health. Automated extraction using artificial intelligence (AI) can improve efficiency and cost-effectiveness. The focus of this study was to autonomously extract comprehensive SDoH details from EHR using a natural language processing (NLP)-based AI pipeline.
A curated set of 1,000 BC Cancer clinical documents with concentrated SDoH information served as the reference standard for training and evaluating NLP models. Two pipelines were used: an open-source pipeline trained on the annotated medical documents and an industrial pretrained solution used as a benchmark. Three experiments optimized the first pipeline's performance, assessing the effect of including subtype word positions during training. The superior open-source pipeline was then used to extract SDoH information from 13,258 oncology documents.
The open-source pipeline achieved an average F1 score accuracy of 0.88 on the validation data set for extracting 13 SDoH factors, surpassing the benchmark by 5%. It excelled in detailed subtype extraction, while the benchmark performed better in identifying rarely annotated SDoH information in BC Cancer data set. Overall, 60,717 SDoH factors and associated details were extracted from BC Cancer EHR oncology documents. The most frequently extracted SDoH factors included tobacco use, employment status, marital status, alcohol consumption, and living status, occurring between 8k to 12k times.
This study demonstrates the potential of an NLP pipeline to extract SDoH factors from clinical notes, with strong performance on limited data, although data set-specific adjustments are needed for broader application across institutions.
健康的社会决定因素(SDoH)对健康结果和不平等现象有重大影响。可从电子健康记录(EHR)中提取SDoH,以辅助政策制定和研究,从而改善人群健康。使用人工智能(AI)进行自动提取可提高效率和成本效益。本研究的重点是使用基于自然语言处理(NLP)的AI管道从EHR中自主提取全面的SDoH详细信息。
一组精心挑选的1000份包含集中SDoH信息的卑诗省癌症临床文档用作训练和评估NLP模型的参考标准。使用了两个管道:一个在带注释的医学文档上训练的开源管道,以及一个用作基准的工业预训练解决方案。进行了三个实验来优化第一个管道的性能,评估在训练期间纳入子类型词位置的影响。然后使用 superior开源管道从13258份肿瘤学文档中提取SDoH信息。
开源管道在提取13个SDoH因素的验证数据集上实现了平均F1分数准确率为0.88,比基准高出5%。它在详细的子类型提取方面表现出色,而基准在识别卑诗省癌症数据集中注释较少的SDoH信息方面表现更好。总体而言,从卑诗省癌症EHR肿瘤学文档中提取了60717个SDoH因素及相关详细信息。最常提取的SDoH因素包括烟草使用、就业状况、婚姻状况、酒精消费和居住状况,出现次数在8000至12000次之间。
本研究证明了NLP管道从临床记录中提取SDoH因素的潜力,在有限数据上表现强劲,尽管需要针对特定数据集进行调整才能在各机构中更广泛地应用。