Chou Kuo-Chen, Shen Hong-Bin
Gordon Life Science Institute, San Diego, CA 92130, USA.
Biochem Biophys Res Commun. 2007 Aug 24;360(2):339-45. doi: 10.1016/j.bbrc.2007.06.027. Epub 2007 Jun 15.
Given an uncharacterized protein sequence, how can we identify whether it is a membrane protein or not? If it is, which membrane protein type it belongs to? These questions are important because they are closely relevant to the biological function of the query protein and to its interaction process with other molecules in a biological system. Particularly, with the avalanche of protein sequences generated in the Post-Genomic Age and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to help address these questions. In this study, a 2-layer predictor, called MemType-2L, has been developed: the 1st layer prediction engine is to identify a query protein as membrane or non-membrane; if it is a membrane protein, the process will be automatically continued with the 2nd-layer prediction engine to further identify its type among the following eight categories: (1) type I, (2) type II, (3) type III, (4) type IV, (5) multipass, (6) lipid-chain-anchored, (7) GPI-anchored, and (8) peripheral. MemType-2L is featured by incorporating the evolution information through representing the protein samples with the Pse-PSSM (Pseudo Position-Specific Score Matrix) vectors, and by containing an ensemble classifier formed by fusing many powerful individual OET-KNN (Optimized Evidence-Theoretic K-Nearest Neighbor) classifiers. The success rates obtained by MemType-2L on a new-constructed stringent dataset by both the jackknife test and the independent dataset test are quite high, indicating that MemType-2L may become a very useful high throughput tool. As a Web server, MemType-2L is freely accessible to the public at http://chou.med.harvard.edu/bioinf/MemType.
对于一个未表征的蛋白质序列,我们如何确定它是否为膜蛋白呢?如果是,它属于哪种膜蛋白类型呢?这些问题很重要,因为它们与查询蛋白的生物学功能以及它在生物系统中与其他分子的相互作用过程密切相关。特别是在后基因组时代产生了大量的蛋白质序列,而利用生化实验来确定其功能的进展相对缓慢得多,因此迫切需要开发一种自动化方法来帮助解决这些问题。在本研究中,开发了一种名为MemType-2L的两层预测器:第一层预测引擎用于将查询蛋白识别为膜蛋白或非膜蛋白;如果是膜蛋白,该过程将自动进入第二层预测引擎,以在以下八种类别中进一步确定其类型:(1) I型,(2) II型,(3) III型,(4) IV型,(5) 多次跨膜型,(6) 脂链锚定型,(7) GPI锚定型,以及(8) 外周型。MemType-2L的特点是通过用伪位置特异性得分矩阵(Pse-PSSM)向量表示蛋白质样本纳入进化信息,并包含一个由融合许多强大的个体优化证据理论K近邻(OET-KNN)分类器形成的集成分类器。通过留一法测试和独立数据集测试,MemType-2L在新构建的严格数据集上获得的成功率相当高,这表明MemType-2L可能成为一个非常有用的高通量工具。作为一个网络服务器,公众可以通过http://chou.med.harvard.edu/bioinf/MemType免费访问MemType-2L。