Sadowski M I, Parish J H, Westhead D R
Bioinformatics Unit, Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK.
Biosystems. 2005 Sep;81(3):247-54. doi: 10.1016/j.biosystems.2005.05.001.
Several stratagems are used in protein bioinformatics for the classification of proteins based on sequence, structure or function. We explore the concept of a minimal signature embedded in a sequence that defines the likely position of a protein in a classification. Specifically, we address the derivation of sparse profiles for the G-protein coupled receptor (GPCR) clan of integral membrane proteins. We present an evolutionary algorithm (EA) for the derivation of sparse profiles (signatures) without the need to supply a multiple alignment. We also apply an evolution strategy (ES) to the problem of pattern and profile refinement. Patterns were derived for the GPCR 'superfamily' and GPCR families 1-3 individually from starting populations of randomly generated signatures, using a database of integral membrane protein sequences and an objective function using a modified receiver operator characteristic (ROC) statistic. The signature derived for the family 1 GPCR sequences was shown to perform very well in a stringent cross-validation test, detecting 76% of unseen GPCR sequences at 5% error. Application of the ES refinement method to a signature developed by a previously described method [Sadowski, M.I., Parish, J.H., 2003. Automated generation and refinement of protein signatures: case study with G-protein coupled receptors. Bioinformatics 19, 727-734] resulted in a 6% increase of coverage for 5% error as measured in the validation test. We note that there might be a limit to this or any classification of proteins based on patterns or schemata.
在蛋白质生物信息学中,有几种策略可用于根据序列、结构或功能对蛋白质进行分类。我们探索了序列中嵌入的最小特征的概念,该特征定义了蛋白质在分类中可能的位置。具体而言,我们研究了整合膜蛋白的G蛋白偶联受体(GPCR)家族的稀疏特征的推导。我们提出了一种进化算法(EA),用于推导稀疏特征(签名),而无需提供多序列比对。我们还将进化策略(ES)应用于模式和特征细化问题。使用整合膜蛋白序列数据库和使用修正的接收者操作特征(ROC)统计量的目标函数,从随机生成的特征的起始群体中分别为GPCR“超家族”和GPCR家族1-3推导模式。在严格的交叉验证测试中,为家族1 GPCR序列推导的特征表现非常出色,在5%的错误率下检测到76%的未见过的GPCR序列。将ES细化方法应用于先前描述的方法[Sadowski, M.I., Parish, J.H., 2003. 蛋白质特征的自动生成和细化:G蛋白偶联受体的案例研究。生物信息学19, 727-734]开发的特征,在验证测试中,5%错误率下的覆盖率提高了6%。我们注意到,基于模式或图式对蛋白质进行这种或任何分类可能存在局限性。