Conev Anja, Fasoulis Romanos, Hall-Swan Sarah, Ferreira Rodrigo, Kavraki Lydia E
Department of Computer Science, Rice University, Houston, TX, USA.
iScience. 2023 Dec 2;27(1):108613. doi: 10.1016/j.isci.2023.108613. eCollection 2024 Jan 19.
Peptide-HLA (pHLA) binding prediction is essential in screening peptide candidates for personalized peptide vaccines. Machine learning (ML) pHLA binding prediction tools are trained on vast amounts of data and are effective in screening peptide candidates. Most ML models report the ability to generalize to HLA alleles unseen during training ("pan-allele" models). However, the use of datasets with imbalanced allele content raises concerns about biased model performance. First, we examine the data bias of two ML-based pan-allele pHLA binding predictors. We find that the pHLA datasets overrepresent alleles from geographic populations of high-income countries. Second, we show that the identified data bias is perpetuated within ML models, leading to algorithmic bias and subpar performance for alleles expressed in low-income geographic populations. We draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term "pan-allele" to describe models trained with currently available public datasets.
肽与人类白细胞抗原(pHLA)结合预测对于筛选个性化肽疫苗的肽候选物至关重要。机器学习(ML)pHLA结合预测工具基于大量数据进行训练,在筛选肽候选物方面很有效。大多数ML模型报告称能够推广到训练期间未见过的HLA等位基因(“泛等位基因”模型)。然而,使用等位基因含量不平衡的数据集引发了对模型性能偏差的担忧。首先,我们检查了两种基于ML的泛等位基因pHLA结合预测器的数据偏差。我们发现pHLA数据集过度代表了来自高收入国家地理人群的等位基因。其次,我们表明所识别的数据偏差在ML模型中持续存在,导致算法偏差以及在低收入地理人群中表达的等位基因的性能不佳。我们提请注意这种偏差可能产生的治疗后果,并对使用“泛等位基因”一词来描述用当前可用公共数据集训练的模型提出质疑。