Yao Zonghai, Zhao Youxia, Mitra Avijit, Levy David A, Druhl Emily, Tsai Jack, Yu Hong
medRxiv. 2025 Jul 14:2025.07.10.25331237. doi: 10.1101/2025.07.10.25331237.
Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce , a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
驱逐是一个重要但研究不足的健康社会决定因素(SDoH),与住房不稳定、失业和心理健康有关。虽然驱逐情况出现在非结构化电子健康记录(EHR)中,但在结构化字段中很少被编码,这限制了下游应用。我们引入了一种可扩展的管道,该管道结合了大语言模型、人工参与标注和自动提示优化(APO),以从临床记录中提取驱逐状态。使用这个管道,我们创建了迄今为止最大的与驱逐相关的公共SDoH数据集,包含14个细粒度类别。在SynthEHR-Eviction上训练的微调大语言模型(如Qwen2.5、LLaMA3)在人工验证数据上的宏观F1分数分别为88.8%(驱逐)和90.3%(其他SDoH),优于GPT-4o-APO(87.8%,87.3%)、GPT-4o-mini-APO(69.1%,78.1%)和BioBERT(60.7%,68.3%),同时能够在各种模型规模上进行经济高效的部署。该管道将标注工作量减少了80%以上,加速了数据集创建,实现了可扩展的驱逐检测,并可推广到其他信息提取任务。