Hu Geng-Ming, Mai Te-Lun, Chen Chi-Ming
Department of Physics, National Taiwan Normal University, Taipei, Taiwan.
Genomics Research Center, Academia Sinica, Taipei, Taiwan.
Sci Rep. 2017 Nov 14;7(1):15495. doi: 10.1038/s41598-017-15707-9.
In this study, we delineate an unsupervised clustering algorithm, minimum span clustering (MSC), and apply it to detect G-protein coupled receptor (GPCR) sequences and to study the GPCR network using a base dataset of 2770 GPCR and 652 non-GPCR sequences. High detection accuracy can be achieved with a proper dataset. The clustering results of GPCRs derived from MSC show a strong correlation between their sequences and functions. By comparing our level 1 MSC results with the GPCRdb classification, the consistency is 87.9% for the fourth level of GPCRdb, 89.2% for the third level, 98.4% for the second level, and 100% for the top level (the lowest resolution level of GPCRdb). The MSC results of GPCRs can be well explained by estimating the selective pressure of GPCRs, as exemplified by investigating the largest two subfamilies, peptide receptors (PRs) and olfactory receptors (ORs), in class A GPCRs. PRs are decomposed into three groups due to a positive selective pressure, whilst ORs remain as a single group due to a negative selective pressure. Finally, we construct and compare phylogenetic trees using distance-based and character-based methods, a combination of which could convey more comprehensive information about the evolution of GPCRs.
在本研究中,我们描述了一种无监督聚类算法——最小跨度聚类(MSC),并将其应用于检测G蛋白偶联受体(GPCR)序列,以及使用包含2770个GPCR序列和652个非GPCR序列的基础数据集来研究GPCR网络。使用合适的数据集可实现较高的检测准确率。源自MSC的GPCR聚类结果显示出其序列与功能之间的强相关性。通过将我们的一级MSC结果与GPCRdb分类进行比较,对于GPCRdb的第四级,一致性为87.9%;对于第三级,一致性为89.2%;对于第二级,一致性为98.4%;对于顶级(GPCRdb的最低分辨率级别),一致性为100%。通过估计GPCR的选择压力,可以很好地解释GPCR的MSC结果,以研究A类GPCR中最大的两个亚家族——肽受体(PRs)和嗅觉受体(ORs)为例。由于正选择压力,PRs被分解为三组,而由于负选择压力,ORs仍为一组。最后,我们使用基于距离和基于特征的方法构建并比较系统发育树,两者结合可以传达有关GPCR进化的更全面信息。