Willem Theresa, Wollek Alessandro, Cheslerean-Boghiu Theodor, Kenney Martha, Buyx Alena
Institute of History and Ethics in Medicine, School of Medicine and Health, Technical University of Munich, Munich, Germany.
Helmholtz AI, Helmholtz Munich, Munich, Germany.
JMIR Med Inform. 2025 Jan 28;13:e59452. doi: 10.2196/59452.
In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.
This study aimed to explore categorical data's effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets' data categories before using them for machine learning training.
Against the theoretical background of the social construction of categories, we suggest a mixed methods approach to assess categorical data's utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Espírito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset authors.
Our quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data.
We caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for which a model is intended.
在医疗保健等数据稀缺的领域,计算机科学家旨在利用尽可能多的可用信息来提高其机器学习模型输出的准确性。作为一种标准,分类数据,如患者的性别、社会经济地位或肤色,被用于与其他数据类型(如医学图像和基于文本的医疗信息)融合来训练模型。然而,在这种数据稀缺的领域中,将分类数据特征纳入模型训练的效果尚未得到充分研究,特别是对于旨在为多样化人群公平服务的模型。
本研究旨在探讨分类数据对机器学习模型输出的影响,将这些影响追溯到数据收集和数据集发布过程,并提出一种混合方法,在将数据集用于机器学习训练之前检查其数据类别。
以类别的社会建构理论为背景,我们建议采用混合方法来评估分类数据在机器学习模型训练中的效用。例如,我们将我们的方法应用于一个巴西皮肤病学数据集(圣埃斯皮里图联邦大学皮肤病学和外科援助项目[PAD-UFES]20)。我们首先进行一项探索性的定量研究,评估在使用数据融合算法训练基于Transformer的模型时,纳入或排除PAD-UFES 20数据集的每个独特分类数据特征的影响。然后,我们将定量分析与基于对数据集作者的访谈对数据类别的定性检查相结合。
我们的定量研究表明,在预测类别中,将分类数据纳入机器学习模型训练会产生分散的影响。我们的定性分析深入了解了分类数据是如何收集的以及为什么会被发布,解释了我们观察到的一些定量影响。我们的研究结果突出了公开可用数据集中分类数据的社会建构性,这意味着一个类别中的数据严重依赖于数据集创建者对这些类别的定义方式以及收集数据的社会医学背景。这揭示了在与数据收集背景不同的情况下使用公开可用数据集的相关局限性。
我们告诫不要在不考虑公开可用数据集分类数据特征的社会建构性和背景依赖性的情况下使用其数据特征,特别是在数据稀缺的领域。我们得出结论,使用定量和定性方法对可用数据特征进行社会科学的、依赖背景的分析,有助于判断分类数据对模型所针对人群的效用。