Department of Computer Science, City University of Hong Kong, Hong Kong SAR.
School of Artificial Intelligence, Jilin University, Jilin, China.
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac078.
Healthcare disparities in multiethnic medical data is a major challenge; the main reason lies in the unequal data distribution of ethnic groups among data cohorts. Biomedical data collected from different cancer genome research projects may consist of mainly one ethnic group, such as people with European ancestry. In contrast, the data distribution of other ethnic races such as African, Asian, Hispanic, and Native Americans can be less visible than the counterpart. Data inequality in the biomedical field is an important research problem, resulting in the diverse performance of machine learning models while creating healthcare disparities. Previous researches have reduced the healthcare disparities only using limited data distributions. In our study, we work on fine-tuning of deep learning and transfer learning models with different multiethnic data distributions for the prognosis of 33 cancer types. In previous studies, to reduce the healthcare disparities, only a single ethnic cohort was used as the target domain with one major source domain. In contrast, we focused on multiple ethnic cohorts as the target domain in transfer learning using the TCGA and MMRF CoMMpass study datasets. After performance comparison for experiments with new data distributions, our proposed model shows promising performance for transfer learning schemes compared to the baseline approach for old and new data distributation experiments.
多民族医学数据中的医疗保健差异是一个主要挑战;主要原因在于数据队列中不同种族之间的数据分布不均。从不同癌症基因组研究项目中收集的生物医学数据可能主要由一个种族组成,例如具有欧洲血统的人。相比之下,其他种族如非洲人、亚洲人、西班牙裔和美洲原住民的数据分布可能不如欧洲人明显。生物医学领域的数据不平等是一个重要的研究问题,导致机器学习模型的性能多样化,同时造成医疗保健差异。以前的研究仅使用有限的数据分布来减少医疗保健差异。在我们的研究中,我们致力于使用不同的多民族数据分布来微调深度学习和迁移学习模型,以预测 33 种癌症类型。在以前的研究中,为了减少医疗保健差异,仅将单个种族队列用作目标域,并使用一个主要源域。相比之下,我们专注于使用 TCGA 和 MMRF CoMMpass 研究数据集的迁移学习中的多个种族队列作为目标域。在对具有新数据分布的实验进行性能比较后,与针对新旧数据分布实验的基线方法相比,我们提出的模型在迁移学习方案中表现出有前景的性能。