Martelli Pier Luigi, Fariselli Piero, Casadio Rita
Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna, via Irnerio 42, 40126 Bologna, Italy.
Bioinformatics. 2003;19 Suppl 1:i205-11. doi: 10.1093/bioinformatics/btg1027.
All-alpha membrane proteins constitute a functionally relevant subset of the whole proteome. Their content ranges from about 10 to 30% of the cell proteins, based on sequence comparison and specific predictive methods. Due to the paucity of membrane proteins solved with atomic resolution, the training/testing sets of predictive methods for protein topography and topology routinely include very few well-solved structures mixed with a hundred proteins known with low resolution. Moreover, available predictors fail in predicting recently crystallised membrane proteins (Chen et al., 2002). Presently the number of well-solved membrane proteins comprises some 59 chains of low sequence homology. It is therefore possible to train/test predictors only with the set of proteins known with atomic resolution and evaluate more thoroughly the performance of different methods.
We implement a cascade-neural network (NN), two different hidden Markov models (HMM), and their ensemble (ENSEMBLE) as a new method. We train and test in cross validation the three methods and ENSEMBLE on the 59 well resolved membrane proteins. ENSEMBLE scores with a per-protein accuracy of 90% for topography and 71% for topology, outperforming the best single method of 7 and 5 percentage points, respectively. When tested on a low resolution set of 151 proteins, with no homology with the 59 proteins, the per-protein accuracy of ENSEMBLE is 76% for topography and 68% for topology. Our results also indicate that the performance of ENSEMBLE is higher than that of the best predictors presently available on the Web.
全α膜蛋白构成了整个蛋白质组中功能相关的一个子集。根据序列比较和特定预测方法,它们在细胞蛋白中的含量范围约为10%至30%。由于以原子分辨率解析的膜蛋白数量稀少,蛋白质拓扑结构和拓扑预测方法的训练/测试集通常只包含极少数解析良好的结构,与一百个低分辨率已知的蛋白质混合在一起。此外,现有的预测器在预测最近结晶的膜蛋白时失败(Chen等人,2002年)。目前,解析良好的膜蛋白数量包括约59条低序列同源性的链。因此,有可能仅使用原子分辨率已知的蛋白质集来训练/测试预测器,并更全面地评估不同方法的性能。
我们实现了一种级联神经网络(NN)、两种不同的隐马尔可夫模型(HMM)及其集成(ENSEMBLE)作为一种新方法。我们在交叉验证中对这三种方法和ENSEMBLE在59个解析良好的膜蛋白上进行训练和测试。ENSEMBLE在拓扑结构预测方面的单蛋白准确率为90%,在拓扑预测方面为71%,分别比最佳单一方法高出7和5个百分点。当在与59个蛋白无同源性的151个低分辨率蛋白集上进行测试时,ENSEMBLE在拓扑结构预测方面的单蛋白准确率为76%,在拓扑预测方面为68%。我们的结果还表明,ENSEMBLE的性能高于目前网络上可用的最佳预测器。