Medina Florez Juan Andres, Raza Shaina, Lynn Ansell Rashida, Shakeri Zahra, Smith Brendan T, Dolatabadi Elham
Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.
Vector Institute, Toronto, Canada.
PLoS One. 2025 Jul 2;20(7):e0326668. doi: 10.1371/journal.pone.0326668. eCollection 2025.
Understanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging natural language processing (NLP) techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity, achieving a macro F1-score of 0.72 and macro AUC of 0.99 on a held-out generalization set. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and under-representation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like "Experienced violence or abuse" and "Has medical insurance" had high entailment rates (82.4%-80.3%), while attributes such as "Is female-identifying," "Is married," and "Has a terminal condition" exhibited high contradiction rates (70.8%-98.5%). Our results highlight the effectiveness of transformer-based NER in extracting SDOH information from case reports. However, the findings also expose critical gaps in the representation of marginalized groups within PCC-related academic case reports, e.g., across gender, insurance status, and age. This work underscores the need for standardized SDOH documentation and inclusive reporting practices to enable more equitable research and inform future health policy and AI model development.
了解新冠后状况(PCC)在弱势群体中的患病率差异对于改善护理和解决交叉不平等问题至关重要。本研究旨在通过利用自然语言处理(NLP)技术分析PCC病例报告中健康的社会决定因素(SDOH)的差异和变化,从而开发一个将SDOH整合到PCC研究中的综合框架。在构建了一个包含来自LitCOVID库的7000多份病例报告的PCC病例报告语料库之后,使用预训练的命名实体识别(NER)模型、人工审核和数据增强对709份报告的子集进行了26种与SDOH相关的核心实体类型注释,以提高实体类型的质量、多样性和代表性。开发了一个集成NER、自然语言推理(NLI)、三元组和频率分析的NLP管道来提取和分析这些实体。针对NER目标评估了仅编码器的变压器模型和基于循环神经网络(RNN)的模型。经过微调的仅编码器BERT模型在对不同句子结构的泛化能力和更大的类别稀疏性方面优于传统的基于RNN的模型,在一个留出的泛化集上实现了0.72的宏F1分数和0.99的宏AUC。探索性分析揭示了实体丰富度的变异性,其中状况、年龄和获得护理等常见实体以及种族和住房状况等敏感类别代表性不足。三元组分析突出了实体之间的频繁共现,包括年龄、性别和状况。NLI目标(蕴含和矛盾分析)表明,“经历过暴力或虐待”和“有医疗保险”等属性的蕴含率较高(82.4%-80.3%),而“自我认同为女性”、“已婚”和“患有绝症”等属性的矛盾率较高(70.8%-98.5%)。我们的结果突出了基于变压器的NER在从病例报告中提取SDOH信息方面的有效性。然而,研究结果也揭示了PCC相关学术病例报告中边缘化群体代表性方面的关键差距,例如在性别、保险状况和年龄方面。这项工作强调了标准化SDOH文档记录和包容性报告实践的必要性,以便能够进行更公平的研究,并为未来的卫生政策和人工智能模型开发提供信息。