Weill Cornell/Rockefeller/Sloan Kettering Tri-Institutional MD-PhD Program, 1300 York Avenue, New York, NY, USA.
Caryl and Israel Englander Institute for Precision Medicine, Weill Cornell Medical College, 413 East 69th Street, New York, NY, USA.
BMC Bioinformatics. 2019 Jan 5;20(1):7. doi: 10.1186/s12859-018-2561-z.
To further our understanding of immunopeptidomics, improved tools are needed to identify peptides presented by major histocompatibility complex class I (MHC-I). Many existing tools are limited by their reliance upon chemical affinity data, which is less biologically relevant than sampling by mass spectrometry, and other tools are limited by incomplete exploration of machine learning approaches. Herein, we assemble publicly available data describing human peptides discovered by sampling the MHC-I immunopeptidome with mass spectrometry and use this database to train random forest classifiers (ForestMHC) to predict presentation by MHC-I.
As measured by precision in the top 1% of predictions, our method outperforms NetMHC and NetMHCpan on test sets, and it outperforms both these methods and MixMHCpred on new data from an ovarian carcinoma cell line. We also find that random forest scores correlate monotonically, but not linearly, with known chemical binding affinities, and an information-based analysis of classifier features shows the importance of anchor positions for our classification. The random-forest approach also outperforms a deep neural network and a convolutional neural network trained on identical data. Finally, we use our large database to confirm that gene expression partially determines peptide presentation.
ForestMHC is a promising method to identify peptides bound by MHC-I. We have demonstrated the utility of random forest-based approaches in predicting peptide presentation by MHC-I, assembled the largest known database of MS binding data, and mined this database to show the effect of gene expression on peptide presentation. ForestMHC has potential applicability to basic immunology, rational vaccine design, and neoantigen binding prediction for cancer immunotherapy. This method is publicly available for applications and further validation.
为了进一步了解免疫肽组学,我们需要改进工具来识别主要组织相容性复合体 I(MHC-I)呈递的肽。许多现有的工具都受到其对化学亲和力数据的依赖的限制,而这些数据不如质谱法采样更具有生物学相关性,其他工具则受到对机器学习方法的不完全探索的限制。在这里,我们收集了描述通过质谱法采样 MHC-I 免疫肽组学发现的人类肽的公开可用数据,并使用该数据库来训练随机森林分类器(ForestMHC)来预测 MHC-I 的呈递。
通过在预测的前 1%中测量精度,我们的方法在测试集中优于 NetMHC 和 NetMHCpan,并且优于这两种方法以及卵巢癌细胞系的新数据中的 MixMHCpred。我们还发现随机森林分数与已知化学结合亲和力单调相关,但不是线性相关,基于信息的分类器特征分析表明锚定位置对我们的分类很重要。随机森林方法也优于在相同数据上训练的深度神经网络和卷积神经网络。最后,我们使用我们的大型数据库来证实基因表达部分决定了肽的呈递。
ForestMHC 是一种有前途的方法,可以识别与 MHC-I 结合的肽。我们已经证明了基于随机森林的方法在预测 MHC-I 呈递肽方面的有效性,组装了已知最大的 MS 结合数据数据库,并挖掘了该数据库以显示基因表达对肽呈递的影响。ForestMHC 具有在基础免疫学、合理疫苗设计和癌症免疫治疗中的新抗原结合预测方面的潜在适用性。此方法可供应用和进一步验证使用。