Chrysostomou Charalambos, Seker Huseyin
Annu Int Conf IEEE Eng Med Biol Soc. 2014;2014:808-11. doi: 10.1109/EMBC.2014.6943714.
Current bioinformatics tools accomplish high accuracies in classifying allergenic protein sequences with high homology and generally perform poorly with low homology protein sequences. Although some homologous regions explained Immunoglobulin E (IgE) cross-reactivity in groups of allergens, no universal molecular structure could be associated with allergenicity. In addition, studies have showed that cross-reactivity is not directly linked to the homology between protein sequences. Therefore, a new homology independent method needs to be developed to determine if a protein is an allergen or not. The aim of this study is therefore to differentiate sets of allergenic and non-allergenic proteins using a signal-processing based bioinformatics approach. In this paper, a new method was proposed for characterisation and classification of allergenic protein sequences. For this method hydrophobicity amino acid index was used to encode proteins to numerical sequences and Discrete Fourier Transform to extract features for each protein. Finally, a classifier was constructed based on Support Vector Machines. In order to demonstrate the applicability of the proposed method 857 allergen and 1000 non-allergen proteins were collected from UniProt online database. The results obtained from the proposed method yielded: MCC: 0.752 ± 0.007, Specificity: 0.912 ± 0.005, Sensitivity: 0.835 ± 0.008 and Total Accuracy: 87.65% ± 0.004.
当前的生物信息学工具在对具有高同源性的致敏蛋白序列进行分类时能达到较高的准确率,但对于低同源性的蛋白序列,其表现通常较差。尽管一些同源区域解释了变应原组中的免疫球蛋白E(IgE)交叉反应性,但没有通用的分子结构与致敏性相关联。此外,研究表明交叉反应性与蛋白序列之间的同源性并无直接关联。因此,需要开发一种新的不依赖同源性的方法来确定一种蛋白是否为变应原。本研究的目的因此是使用基于信号处理的生物信息学方法区分致敏蛋白和非致敏蛋白集。本文提出了一种用于致敏蛋白序列表征和分类的新方法。对于该方法,使用疏水性氨基酸指数将蛋白编码为数字序列,并使用离散傅里叶变换提取每种蛋白的特征。最后,基于支持向量机构建了一个分类器。为了证明所提方法的适用性,从UniProt在线数据库中收集了857个变应原蛋白和1000个非变应原蛋白。所提方法得到的结果为:马修斯相关系数(MCC):0.752±0.007,特异性:0.912±0.005,灵敏度:0.835±0.008,总准确率:87.65%±0.004。