Plant Biology Division, Samuel Roberts Noble Foundation, Ardmore, Oklahoma 73401, USA.
Plant Physiol. 2010 Sep;154(1):36-54. doi: 10.1104/pp.110.156851. Epub 2010 Jul 20.
A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein. Developing genome-wide prediction tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. To this end, we performed a comprehensive study in Arabidopsis and created an integrative support vector machine-based localization predictor called AtSubP (for Arabidopsis subcellular localization predictor) that is based on the combinatorial presence of diverse protein features, such as its amino acid composition, sequence-order effects, terminal information, Position-Specific Scoring Matrix, and similarity search-based Position-Specific Iterated-Basic Local Alignment Search Tool information. When used to predict seven subcellular compartments through a 5-fold cross-validation test, our hybrid-based best classifier achieved an overall sensitivity of 91% with high-confidence precision and Matthews correlation coefficient values of 90.9% and 0.89, respectively. Benchmarking AtSubP on two independent data sets, one from Swiss-Prot and another containing green fluorescent protein- and mass spectrometry-determined proteins, showed a significant improvement in the prediction accuracy of species-specific AtSubP over some widely used "general" tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT, Plant-PLoc, and our newly created All-Plant method. Cross-comparison of AtSubP on six nontrained eukaryotic organisms (rice [Oryza sativa], soybean [Glycine max], human [Homo sapiens], yeast [Saccharomyces cerevisiae], fruit fly [Drosophila melanogaster], and worm [Caenorhabditis elegans]) revealed inferior predictions. AtSubP significantly outperformed all the prediction tools being currently used for Arabidopsis proteome annotation and, therefore, may serve as a better complement for the plant research community. A supplemental Web site that hosts all the training/testing data sets and whole proteome predictions is available at http://bioinfo3.noble.org/AtSubP/.
拟南芥(Arabidopsis thaliana)蛋白质组的完整图谱显然是植物研究界的主要目标,目的是确定每个编码蛋白的功能和调控。开发基因组范围的预测工具,如在亚细胞水平上定位基因产物,将大大推进拟南芥基因注释。为此,我们在拟南芥中进行了全面研究,并创建了一个基于集成支持向量机的定位预测器,称为 AtSubP(用于拟南芥亚细胞定位预测器),它基于多种蛋白质特征的组合存在,如氨基酸组成、序列顺序效应、末端信息、位置特异性评分矩阵和基于相似性搜索的位置特异性迭代基本局部比对搜索工具信息。当通过 5 倍交叉验证测试用于预测七个亚细胞区室时,我们的混合最佳分类器的整体敏感性达到 91%,具有高精度和 Matthews 相关系数值,分别为 90.9%和 0.89。在两个独立数据集(一个来自 Swiss-Prot,另一个包含绿色荧光蛋白和质谱确定的蛋白质)上对 AtSubP 进行基准测试表明,物种特异性 AtSubP 的预测准确性相对于一些广泛使用的“通用”工具(如 TargetP、LOCtree、PA-SUB、MultiLoc、WoLF PSORT、Plant-PLoc 和我们新创建的 All-Plant 方法)有了显著提高。在六个未经训练的真核生物(水稻(Oryza sativa)、大豆(Glycine max)、人类(Homo sapiens)、酵母(Saccharomyces cerevisiae)、果蝇(Drosophila melanogaster)和线虫(Caenorhabditis elegans))上进行 AtSubP 的交叉比较表明预测结果较差。AtSubP 显著优于目前用于拟南芥蛋白质组注释的所有预测工具,因此可能成为植物研究界的更好补充。一个包含所有训练/测试数据集和整个蛋白质组预测的补充网站可在 http://bioinfo3.noble.org/AtSubP/ 上获得。