Shah Anuj R, Oehmen Christopher S, Harper Jill, Webb-Robertson Bobbie-Jo M
Computational Biology & Bioinformatics, Pacific Northwest National Laboratory, Richland, WA 99352, USA.
Comput Biol Chem. 2007 Apr;31(2):138-42. doi: 10.1016/j.compbiolchem.2007.02.012. Epub 2007 Feb 23.
A significant challenge in homology detection is to identify sequences that share a common evolutionary ancestor, despite significant primary sequence divergence. Remote homologs will often have less than 30% sequence identity, yet still retain common structural and functional properties. We demonstrate a novel method for identifying remote homologs using a support vector machine (SVM) classifier trained by fusing sequence similarity scores and subcellular location prediction. SVMs have been shown to perform well in a variety of applications where binary classification of data is the goal. At the same time, data fusion methods have been shown to be highly effective in enhancing discriminative power of data. Combining these two approaches in the application SVM-SimLoc resulted in identification of significantly more remote homologs (p-value<0.006) than using either sequence similarity or subcellular location independently.
同源性检测中的一个重大挑战是识别那些尽管一级序列存在显著差异,但却拥有共同进化祖先的序列。远源同源物通常序列同一性低于30%,但仍保留共同的结构和功能特性。我们展示了一种使用支持向量机(SVM)分类器识别远源同源物的新方法,该分类器通过融合序列相似性得分和亚细胞定位预测进行训练。在以数据二分类为目标的各种应用中,支持向量机已被证明表现良好。同时,数据融合方法已被证明在增强数据的判别力方面非常有效。在SVM-SimLoc应用中结合这两种方法,与单独使用序列相似性或亚细胞定位相比,能识别出显著更多的远源同源物(p值<0.006)。