Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa294.
AlgPred 2.0 is a web server developed for predicting allergenic proteins and allergenic regions in a protein. It is an updated version of AlgPred developed in 2006. The dataset used for training, testing and validation consists of 10 075 allergens and 10 075 non-allergens. In addition, 10 451 experimentally validated immunoglobulin E (IgE) epitopes were used to identify antigenic regions in a protein. All models were trained on 80% of data called training dataset, and the performance of models was evaluated using 5-fold cross-validation technique. The performance of the final model trained on the training dataset was evaluated on 20% of data called validation dataset; no two proteins in any two sets have more than 40% similarity. First, a Basic Local Alignment Search Tool (BLAST) search has been performed against the dataset, and allergens were predicted based on the level of similarity with known allergens. Second, IgE epitopes obtained from the IEDB database were searched in the dataset to predict allergens based on their presence in a protein. Third, motif-based approaches like multiple EM for motif elicitation/motif alignment and search tool have been used to predict allergens. Fourth, allergen prediction models have been developed using a wide range of machine learning techniques. Finally, the ensemble approach has been used for predicting allergenic protein by combining prediction scores of different approaches. Our best model achieved maximum performance in terms of area under receiver operating characteristic curve 0.98 with Matthew's correlation coefficient 0.85 on the validation dataset. A web server AlgPred 2.0 has been developed that allows the prediction of allergens, mapping of IgE epitope, motif search and BLAST search (https://webs.iiitd.edu.in/raghava/algpred2/).
AlgPred 2.0 是一个用于预测蛋白质中的变应原蛋白和变应原区域的网络服务器。它是 2006 年开发的 AlgPred 的更新版本。用于训练、测试和验证的数据集由 10075 种过敏原和 10075 种非过敏原组成。此外,还使用了 10451 个经过实验验证的免疫球蛋白 E(IgE)表位来识别蛋白质中的抗原区域。所有模型都是在 80%的数据(称为训练数据集)上进行训练的,模型的性能是通过 5 倍交叉验证技术进行评估的。在训练数据集上训练的最终模型的性能是在 20%的数据(称为验证数据集)上进行评估的;任何两个集合中的两个蛋白质之间的相似度都不超过 40%。首先,对数据集进行了基本局部比对搜索工具(BLAST)搜索,并根据与已知过敏原的相似性来预测过敏原。其次,在数据集内搜索从 IEDB 数据库获得的 IgE 表位,根据其在蛋白质中的存在来预测过敏原。第三,使用基于模体的方法,如多模体启发/模体对齐和搜索工具,来预测过敏原。第四,使用多种机器学习技术开发了过敏原预测模型。最后,通过结合不同方法的预测分数,使用集成方法来预测变应原蛋白。我们的最佳模型在验证数据集上的接收者操作特征曲线下面积达到了 0.98,马修相关系数为 0.85,达到了最高性能。开发了一个名为 AlgPred 2.0 的网络服务器,允许进行过敏原预测、IgE 表位映射、模体搜索和 BLAST 搜索(https://webs.iiitd.edu.in/raghava/algpred2/)。