通过C端和N端的联合特征对细菌IV型分泌效应蛋白进行有效预测。
Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini.
作者信息
Wang Yu, Guo Yanzhi, Pu Xuemei, Li Menglong
机构信息
College of Chemistry, Sichuan University, Chengdu, 610064, China.
College of Materials and Chemistry & Chemical Engineering, Chengdu University of Technology, Chengdu, 610059, China.
出版信息
J Comput Aided Mol Des. 2017 Nov;31(11):1029-1038. doi: 10.1007/s10822-017-0080-z. Epub 2017 Nov 10.
Various bacterial pathogens can deliver their secreted substrates also called as effectors through type IV secretion systems (T4SSs) into host cells and cause diseases. Since T4SS secreted effectors (T4SEs) play important roles in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T4SSs. A few computational methods using machine learning algorithms for T4SEs prediction have been developed by using features of C-terminal residues. However, recent studies have shown that targeting information can also be encoded in the N-terminal region of at least some T4SEs. In this study, we present an effective method for T4SEs prediction by novelly integrating both N-terminal and C-terminal sequence information. First, we collected a comprehensive dataset across multiple bacterial species of known T4SEs and non-T4SEs from literatures. Then, three types of distinctive features, namely amino acid composition, composition, transition and distribution and position-specific scoring matrices were calculated for 50 N-terminal and 100 C-terminal residues. After that, we employed information gain represent to rank the importance score of the 150 different position residues for T4SE secretion signaling. At last, 125 distinctive position residues were singled out for the prediction model to classify T4SEs and non-T4SEs. The support vector machine model yields a high receiver operating curve of 0.916 in the fivefold cross-validation and an accuracy of 85.29% for the independent test set.
多种细菌病原体可通过IV型分泌系统(T4SSs)将其分泌的底物(也称为效应蛋白)传递到宿主细胞中并引发疾病。由于T4SS分泌的效应蛋白(T4SEs)在病原体与宿主的相互作用中发挥着重要作用,因此识别它们对于我们理解T4SSs的致病机制至关重要。已经开发了一些使用机器学习算法基于C末端残基特征来预测T4SEs的计算方法。然而,最近的研究表明,靶向信息也可以编码在至少一些T4SEs的N末端区域。在本研究中,我们通过创新性地整合N末端和C末端序列信息,提出了一种有效的T4SEs预测方法。首先,我们从文献中收集了一个涵盖多种细菌物种的已知T4SEs和非T4SEs的综合数据集。然后,针对50个N末端和100个C末端残基计算了三种不同类型的特征,即氨基酸组成、组成、转换和分布以及位置特异性评分矩阵。之后,我们使用信息增益来对150个不同位置残基对于T4SE分泌信号的重要性得分进行排名。最后,为预测模型挑选出125个独特的位置残基以区分T4SEs和非T4SEs。支持向量机模型在五折交叉验证中产生了0.916的高受试者工作曲线,在独立测试集中的准确率为85.29%。