Jiang Zifan, Seyedi Salman, Griner Emily, Abbasi Ahmed, Rad Ali Bahrami, Kwon Hyeokhyen, Cotes Robert O, Clifford Gari D
Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, Georgia, United States of America.
Department of Biomedical Engineering, Emory University and Georgia Institute of Technology, Atlanta, Georgia, United States of America.
PLOS Digit Health. 2024 Jul 24;3(7):e0000413. doi: 10.1371/journal.pdig.0000413. eCollection 2024 Jul.
Research on automated mental health assessment tools has been growing in recent years, often aiming to address the subjectivity and bias that existed in the current clinical practice of the psychiatric evaluation process. Despite the substantial health and economic ramifications, the potential unfairness of those automated tools was understudied and required more attention. In this work, we systematically evaluated the fairness level in a multimodal remote mental health dataset and an assessment system, where we compared the fairness level in race, gender, education level, and age. Demographic parity ratio (DPR) and equalized odds ratio (EOR) of classifiers using different modalities were compared, along with the F1 scores in different demographic groups. Post-training classifier threshold optimization was employed to mitigate the unfairness. No statistically significant unfairness was found in the composition of the dataset. Varying degrees of unfairness were identified among modalities, with no single modality consistently demonstrating better fairness across all demographic variables. Post-training mitigation effectively improved both DPR and EOR metrics at the expense of a decrease in F1 scores. Addressing and mitigating unfairness in these automated tools are essential steps in fostering trust among clinicians, gaining deeper insights into their use cases, and facilitating their appropriate utilization.
近年来,关于自动化心理健康评估工具的研究不断增加,其目的通常是解决当前精神病评估过程临床实践中存在的主观性和偏差问题。尽管这些自动化工具具有重大的健康和经济影响,但其潜在的不公平性却未得到充分研究,需要更多关注。在这项工作中,我们系统地评估了一个多模态远程心理健康数据集和一个评估系统中的公平性水平,在其中我们比较了种族、性别、教育水平和年龄方面的公平性水平。比较了使用不同模态的分类器的人口统计学均等率(DPR)和均等赔率比(EOR),以及不同人口群体中的F1分数。采用训练后分类器阈值优化来减轻不公平性。在数据集的构成中未发现具有统计学意义的不公平性。在不同模态之间发现了不同程度的不公平性,没有单一模态在所有人口统计学变量上始终表现出更好的公平性。训练后缓解措施有效地提高了DPR和EOR指标,但代价是F1分数下降。解决和减轻这些自动化工具中的不公平性是在临床医生之间建立信任、更深入了解其用例并促进其合理使用的重要步骤。