Nielsen Morten, Lundegaard Claus, Worning Peder, Lauemøller Sanne Lise, Lamberth Kasper, Buus Søren, Brunak Søren, Lund Ole
Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark.
Protein Sci. 2003 May;12(5):1007-17. doi: 10.1110/ps.0239403.
In this paper we describe an improved neural network method to predict T-cell class I epitopes. A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. We demonstrate that the combination of several neural networks derived using different sequence-encoding schemes has a performance superior to neural networks derived using a single sequence-encoding scheme. The new method is shown to have a performance that is substantially higher than that of other methods. By use of mutual information calculations we show that peptides that bind to the HLA A*0204 complex display signal of higher order sequence correlations. Neural networks are ideally suited to integrate such higher order correlations when predicting the binding affinity. It is this feature combined with the use of several neural networks derived from different and novel sequence-encoding schemes and the ability of the neural network to be trained on data consisting of continuous binding affinities that gives the new method an improved performance. The difference in predictive performance between the neural network methods and that of the matrix-driven methods is found to be most significant for peptides that bind strongly to the HLA molecule, confirming that the signal of higher order sequence correlation is most strongly present in high-binding peptides. Finally, we use the method to predict T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.
在本文中,我们描述了一种改进的神经网络方法来预测T细胞I类表位。我们开发了一种新颖的输入表示方法,它由稀疏编码、布洛斯姆(Blosum)编码以及从隐马尔可夫模型导出的输入组合而成。我们证明,使用不同序列编码方案导出的多个神经网络的组合,其性能优于使用单一序列编码方案导出的神经网络。结果表明,新方法的性能显著高于其他方法。通过互信息计算,我们表明与HLA A*0204复合物结合的肽显示出高阶序列相关性信号。在预测结合亲和力时,神经网络非常适合整合这种高阶相关性。正是这一特性,结合使用从不同且新颖的序列编码方案导出的多个神经网络,以及神经网络在由连续结合亲和力组成的数据上进行训练的能力,使得新方法具有更好的性能。发现神经网络方法与矩阵驱动方法在预测性能上的差异,对于与HLA分子强烈结合的肽最为显著,这证实了高阶序列相关性信号在高结合肽中最为强烈。最后,我们使用该方法预测丙型肝炎病毒基因组的T细胞表位,并讨论了该预测方法在指导合理疫苗设计过程中的可能应用。