Sandhan Tushar, Yoo Youngjun, Choi Jin, Kim Sun
BMC Med Genomics. 2015;8 Suppl 2(Suppl 2):S12. doi: 10.1186/1755-8794-8-S2-S12. Epub 2015 May 29.
Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy.
Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels.
Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.
揭示生物序列中隐藏的组织特征和规律是详细理解潜在生物现象的关键问题。因此,从核酸序列中进行模式识别是蛋白质功能预测的重要工作。由于来自同一家族的蛋白质具有相似的特征,基于同源性的方法通过蛋白质分类来预测蛋白质功能。但传统的分类方法大多仅通过考虑强蛋白质相似性匹配来依赖全局特征。这导致预测准确性显著损失。
在此,我们构建了蛋白质-蛋白质相似性(PPS)网络,该网络捕获蛋白质家族的细微特性。所提出的方法通过检查PPS网络中“弱相互作用蛋白质”之间的相互作用,并通过基于图金字塔的层次图分析来考虑局部和全局特征。通过在不同金字塔级别操作所提出的基于图的特征,揭示了蛋白质家族的不同潜在特性。
在基准数据集上的实验结果表明,所提出的使用图金字塔的层次投票算法有助于提高计算效率以及蛋白质分类准确性。定量地说,在14,086个测试序列中,所提出的方法平均仅误分类21.1个序列,而基于基线BLAST评分的全局特征匹配方法误分类362.9个序列。对于每个正确分类的测试序列,所提出方法的快速增量学习能力进一步增强了训练模型。因此,仅使用每个类20%的训练数据,它就实现了超过96%的蛋白质分类准确性。