Complex and Adaptive Systems Laboratory, Conway Institute of Biomolecular and Biomedical Science, School of Medicine and Medical Science, University College Dublin, Ireland.
Amino Acids. 2013 Aug;45(2):291-9. doi: 10.1007/s00726-013-1491-3. Epub 2013 Apr 9.
Knowledge of the subcellular location of a protein provides valuable information about its function, possible interaction with other proteins and drug targetability, among other things. The experimental determination of a protein's location in the cell is expensive, time consuming and open to human error. Fast and accurate predictors of subcellular location have an important role to play if the abundance of sequence data which is now available is to be fully exploited. In the post-genomic era, genomes in many diverse organisms are available. Many of these organisms are important in human and veterinary disease and fall outside of the well-studied plant, animal and fungi groups. We have developed a general eukaryotic subcellular localisation predictor (SCL-Epred) which predicts the location of eukaryotic proteins into three classes which are important, in particular, for determining the drug targetability of a protein-secreted proteins, membrane proteins and proteins that are neither secreted nor membrane. The algorithm powering SCL-Epred is a N-to-1 neural network and is trained on very large non-redundant sets of protein sequences. SCL-Epred performs well on training data achieving a Q of 86 % and a generalised correlation of 0.75 when tested in tenfold cross-validation on a set of 15,202 redundancy reduced protein sequences. The three class accuracy of SCL-Epred and LocTree2, and in particular a consensus predictor comprising both methods, surpasses that of other widely used predictors when benchmarked using a large redundancy reduced independent test set of 562 proteins. SCL-Epred is publicly available at http://distillf.ucd.ie/distill/ .
蛋白质的亚细胞定位知识提供了有关其功能、与其他蛋白质相互作用以及药物靶标性等方面的有价值信息。在细胞中确定蛋白质位置的实验测定既昂贵又耗时,并且容易出现人为错误。如果要充分利用现在可用的大量序列数据,快速准确的亚细胞定位预测因子将发挥重要作用。在后基因组时代,许多不同生物体的基因组都可用。其中许多生物体在人类和兽医疾病中很重要,并且不属于经过充分研究的植物、动物和真菌群体。我们开发了一种通用的真核亚细胞定位预测器(SCL-Epred),可将真核蛋白质预测到三个重要类别,特别是用于确定蛋白质的药物靶标性:分泌蛋白、膜蛋白和既不分泌也不膜的蛋白。驱动 SCL-Epred 的算法是一个 N 到 1 的神经网络,并在非常大的非冗余蛋白质序列集上进行训练。SCL-Epred 在训练数据上表现良好,在对 15202 个冗余减少的蛋白质序列进行 10 倍交叉验证时,其 Q 值达到 86%,广义相关性为 0.75。SCL-Epred 和 LocTree2 的三类别准确性,特别是包含这两种方法的共识预测器,在使用 562 个冗余减少的独立测试集进行基准测试时,超过了其他广泛使用的预测器的准确性。SCL-Epred 可在 http://distillf.ucd.ie/distill/ 上公开获取。