La Jolla Institute for Allergy and Immunology, La Jolla, USA.
BMC Bioinformatics. 2010 Nov 22;11:568. doi: 10.1186/1471-2105-11-568.
MHC class II binding predictions are widely used to identify epitope candidates in infectious agents, allergens, cancer and autoantigens. The vast majority of prediction algorithms for human MHC class II to date have targeted HLA molecules encoded in the DR locus. This reflects a significant gap in knowledge as HLA DP and DQ molecules are presumably equally important, and have only been studied less because they are more difficult to handle experimentally.
In this study, we aimed to narrow this gap by providing a large scale dataset of over 17,000 HLA-peptide binding affinities for a set of 11 HLA DP and DQ alleles. We also expanded our dataset for HLA DR alleles resulting in a total of 40,000 MHC class II binding affinities covering 26 allelic variants. Utilizing this dataset, we generated prediction tools utilizing several machine learning algorithms and evaluated their performance.
We found that 1) prediction methodologies developed for HLA DR molecules perform equally well for DP or DQ molecules. 2) Prediction performances were significantly increased compared to previous reports due to the larger amounts of training data available. 3) The presence of homologous peptides between training and testing datasets should be avoided to give real-world estimates of prediction performance metrics, but the relative ranking of different predictors is largely unaffected by the presence of homologous peptides, and predictors intended for end-user applications should include all training data for maximum performance. 4) The recently developed NN-align prediction method significantly outperformed all other algorithms, including a naïve consensus based on all prediction methods. A new consensus method dropping the comparably weak ARB prediction method could outperform the NN-align method, but further research into how to best combine MHC class II binding predictions is required.
MHC Ⅱ类结合预测被广泛用于鉴定传染病原体、过敏原、癌症和自身抗原中的表位候选物。迄今为止,大多数用于人类 MHC Ⅱ类的预测算法都针对 HLA 基因座中编码的 DR 分子。这反映了一个重要的知识差距,因为 HLA DP 和 DQ 分子可能同样重要,而且由于它们在实验上更难处理,所以研究得较少。
在这项研究中,我们旨在通过提供超过 17000 个 HLA-肽结合亲和力的大型数据集来缩小这一差距,这些数据针对一组 11 个 HLA DP 和 DQ 等位基因。我们还扩展了 HLA DR 等位基因的数据集,总共得到了 40000 个 MHC Ⅱ类结合亲和力,涵盖了 26 个等位基因变体。利用这个数据集,我们生成了利用几种机器学习算法的预测工具,并评估了它们的性能。
我们发现 1)针对 HLA DR 分子开发的预测方法同样适用于 DP 或 DQ 分子。2)由于可用的训练数据量增加,预测性能与以前的报告相比有了显著提高。3)为了获得预测性能指标的真实估计,应避免在训练和测试数据集之间存在同源肽,但不同预测器的相对排名受同源肽的存在影响不大,并且用于终端用户应用的预测器应包含所有训练数据以获得最大性能。4)最近开发的 NN-align 预测方法显著优于所有其他算法,包括基于所有预测方法的简单共识。一种新的共识方法放弃相对较弱的 ARB 预测方法可能会超过 NN-align 方法,但需要进一步研究如何最好地结合 MHC Ⅱ类结合预测。