School of Nursing, Columbia University, 560 West 168th St, Mail Code 6, New York, NY, 10032, USA.
Department of Computer Science, Aalto University, Espoo, Finland.
Matern Child Health J. 2024 Mar;28(3):578-586. doi: 10.1007/s10995-023-03857-4. Epub 2023 Dec 26.
Stigma and bias related to race and other minoritized statuses may underlie disparities in pregnancy and birth outcomes. One emerging method to identify bias is the study of stigmatizing language in the electronic health record. The objective of our study was to develop automated natural language processing (NLP) methods to identify two types of stigmatizing language: marginalizing language and its complement, power/privilege language, accurately and automatically in labor and birth notes.
We analyzed notes for all birthing people > 20 weeks' gestation admitted for labor and birth at two hospitals during 2017. We then employed text preprocessing techniques, specifically using TF-IDF values as inputs, and tested machine learning classification algorithms to identify stigmatizing and power/privilege language in clinical notes. The algorithms assessed included Decision Trees, Random Forest, and Support Vector Machines. Additionally, we applied a feature importance evaluation method (InfoGain) to discern words that are highly correlated with these language categories.
For marginalizing language, Decision Trees yielded the best classification with an F-score of 0.73. For power/privilege language, Support Vector Machines performed optimally, achieving an F-score of 0.91. These results demonstrate the effectiveness of the selected machine learning methods in classifying language categories in clinical notes.
We identified well-performing machine learning methods to automatically detect stigmatizing language in clinical notes. To our knowledge, this is the first study to use NLP performance metrics to evaluate the performance of machine learning methods in discerning stigmatizing language. Future studies should delve deeper into refining and evaluating NLP methods, incorporating the latest algorithms rooted in deep learning.
与种族和其他少数群体地位相关的污名化和偏见可能是导致妊娠和分娩结果差异的原因之一。一种识别偏见的新兴方法是研究电子健康记录中的污名化语言。我们的研究目的是开发自动化自然语言处理(NLP)方法,以准确和自动地识别劳动和分娩记录中的两种污名化语言:边缘化语言及其补充语——权力/特权语言。
我们分析了 2017 年在两家医院住院分娩的 20 周以上妊娠产妇的记录。然后,我们采用了文本预处理技术,特别是使用 TF-IDF 值作为输入,并测试了机器学习分类算法,以识别临床记录中的污名化和权力/特权语言。评估的算法包括决策树、随机森林和支持向量机。此外,我们还应用了特征重要性评估方法(InfoGain)来辨别与这些语言类别高度相关的词汇。
对于边缘化语言,决策树的分类效果最佳,F1 得分为 0.73。对于权力/特权语言,支持向量机的表现最佳,F1 得分为 0.91。这些结果表明,所选机器学习方法在对临床记录中的语言类别进行分类方面是有效的。
我们确定了性能良好的机器学习方法,可以自动检测临床记录中的污名化语言。据我们所知,这是第一项使用 NLP 性能指标评估机器学习方法在识别污名化语言方面性能的研究。未来的研究应深入研究改进和评估 NLP 方法,并结合基于深度学习的最新算法。