Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India.
PLoS One. 2013;8(2):e46468. doi: 10.1371/journal.pone.0046468. Epub 2013 Feb 15.
Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of "recent" paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request.
利用未标记的序列仅对蛋白质进行远程同源检测是比较基因组学中的一个核心问题。基于邻域和轮廓的现有聚类核方法和马尔可夫聚类算法是目前用于蛋白质家族识别的最流行的方法。这些方法中的随机游走偏离膨胀或相似性度量中对硬阈值的依赖,需要增强多域蛋白质之间的同源检测。我们建议将谱聚类与马尔可夫相似性中的邻域核结合起来,以提高检测与“最近”旁系同源物无关的同源性的敏感性。具有新组合局部对齐核的谱聚类方法更有效地利用了无监督的蛋白质序列全局,减少了簇间的游走。当与基于修改后的对称近邻规范的校正相结合时,该方法可以减少异常值的影响,该技术在所有 12 个实现的核中优于其他最先进的聚类核。与最先进的字符串和错配核的比较也显示了所提出的核提供的优越性能得分。在现有大型数据集上也发现了类似的性能改进。因此,提出的基于谱聚类框架的组合局部对齐核与基于修改的对称校正相结合,即使在来自 Genolevures 数据库家族的多域和混杂域蛋白质中,也能实现更好的生物学相关性的无监督远程同源检测的优越性能。如有需要,请提供源代码。