Chang Jia-Ming, Su Emily Chia-Yu, Lo Allan, Chiu Hua-Sheng, Sung Ting-Yi, Hsu Wen-Lian
Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan.
Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/.
蛋白质亚细胞定位预测(PSL)对于基因组注释、蛋白质功能预测及药物研发而言至关重要。近年来,针对革兰氏阴性菌,已提出了许多基于蛋白质序列的PSL预测计算方法。我们提出了PSLDoc,一种基于间隔二肽和概率潜在语义分析(PLSA)的方法来解决此问题。蛋白质被视为由间隔二肽组成的词串,间隔二肽定义为被一个或多个位置隔开的任意两个残基。间隔二肽的加权方案根据包含序列进化信息的位置特异性评分矩阵来计算。然后,将PLSA用于特征约简,并将约简后的向量输入到五个一对其余支持向量机分类器中。概率最高的定位位点被指定为最终预测结果。据报道,序列同源性与亚细胞定位之间存在很强的相关性(奈尔和罗斯特,《蛋白质科学》2002年;11:2836 - 2847;于等人,《蛋白质》2006年;64:643 - 651)。为了正确评估PSLDoc的性能,可将目标蛋白质分类到低同源性或高同源性数据集中。PSLDoc在低同源性和高同源性数据集中的总体准确率分别达到86.84%和98.21%,与CELLO II相比具有优势(于等人,《蛋白质》2006年;64:643 - 651)。此外,我们设置了一个置信阈值,以便在指定召回率水平下实现高精度。当置信阈值设置为0.7时,PSLDoc的精确率达到97.89%,比PSORTb v.2.0要好得多(加迪等人,《生物信息学》2005年;21:617 - 623)。我们的方法表明,蛋白质的特定特征表示可成功应用于蛋白质亚细胞定位预测,并提高预测准确率。此外,由于该表示具有通用性,我们的方法未来可扩展到真核蛋白质组。PSLDoc的网络服务器可在http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/上公开获取。