Vasylenko Tamara, Liou Yi-Fan, Chen Hong-An, Charoenkwan Phasit, Huang Hui-Ling, Ho Shinn-Ying
BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2105-16-S1-S8. Epub 2015 Jan 21.
Photosynthetic proteins (PSPs) greatly differ in their structure and function as they are involved in numerous subprocesses that take place inside an organelle called a chloroplast. Few studies predict PSPs from sequences due to their high variety of sequences and structues. This work aims to predict and characterize PSPs by establishing the datasets of PSP and non-PSP sequences and developing prediction methods.
A novel bioinformatics method of predicting and characterizing PSPs based on scoring card method (SCMPSP) was used. First, a dataset consisting of 649 PSPs was established by using a Gene Ontology term GO:0015979 and 649 non-PSPs from the SwissProt database with sequence identity <= 25%.- Several prediction methods are presented based on support vector machine (SVM), decision tree J48, Bayes, BLAST, and SCM. The SVM method using dipeptide features-performed well and yielded - a test accuracy of 72.31%. The SCMPSP method uses the estimated propensity scores of 400 dipeptides - as PSPs and has a test accuracy of 71.54%, which is comparable to that of the SVM method. The derived propensity scores of 20 amino acids were further used to identify informative physicochemical properties for characterizing PSPs. The analytical results reveal the following four characteristics of PSPs: 1) PSPs favour hydrophobic side chain amino acids; 2) PSPs are composed of the amino acids prone to form helices in membrane environments; 3) PSPs have low interaction with water; and 4) PSPs prefer to be composed of the amino acids of electron-reactive side chains.
The SCMPSP method not only estimates the propensity of a sequence to be PSPs, it also discovers characteristics that further improve understanding of PSPs. The SCMPSP source code and the datasets used in this study are available at http://iclab.life.nctu.edu.tw/SCMPSP/.
光合蛋白(PSP)在结构和功能上存在很大差异,因为它们参与了叶绿体这一细胞器内发生的众多子过程。由于PSP序列和结构的高度多样性,很少有研究从序列中预测PSP。这项工作旨在通过建立PSP和非PSP序列数据集并开发预测方法来预测和表征PSP。
使用了一种基于评分卡方法(SCMPSP)预测和表征PSP的新型生物信息学方法。首先,通过使用基因本体术语GO:0015979和来自SwissProt数据库的649个序列同一性<=25%的非PSP建立了一个由649个PSP组成的数据集。提出了几种基于支持向量机(SVM)、决策树J48、贝叶斯、BLAST和SCM的预测方法。使用二肽特征的SVM方法表现良好,测试准确率为72.31%。SCMPSP方法使用400种二肽作为PSP的估计倾向得分,测试准确率为71.54%,与SVM方法相当。进一步使用20种氨基酸的推导倾向得分来识别用于表征PSP的信息性物理化学性质。分析结果揭示了PSP的以下四个特征:1)PSP倾向于疏水侧链氨基酸;2)PSP由在膜环境中易于形成螺旋的氨基酸组成;3)PSP与水的相互作用较低;4)PSP更倾向于由具有电子反应性侧链的氨基酸组成。
SCMPSP方法不仅估计了序列成为PSP的倾向,还发现了有助于进一步理解PSP的特征。本研究中使用的SCMPSP源代码和数据集可在http://iclab.life.nctu.edu.tw/SCMPSP/获取。