PredPPCrys：利用多步异构特征融合与选择从蛋白质序列准确预测序列克隆、蛋白质生产、纯化及结晶倾向。

PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

作者信息

Wang Huilin, Wang Mingjun, Tan Hao, Li Yuan, Zhang Ziding, Song Jiangning

机构信息

National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.

Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia.

出版信息

PLoS One. 2014 Aug 22;9(8):e105902. doi: 10.1371/journal.pone.0105902. eCollection 2014.

DOI:10.1371/journal.pone.0105902

PMID:25148528

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4141844/

Abstract

X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.

摘要

X射线晶体学是解析蛋白质三维结构的主要方法。然而，该方法的一个主要瓶颈是多步实验过程未能得到具有衍射质量的晶体，这些步骤包括序列克隆、蛋白质材料制备、纯化、结晶以及最终的结构测定。因此，基于蛋白质序列预测蛋白质成功完成这些实验过程的倾向，可能有助于减少繁琐的实验工作并促进靶点选择。为此，已经开发了许多基于蛋白质序列信息的生物信息学方法。然而，我们对蛋白质序列产生高质量衍射晶体倾向的重要决定因素的了解仍然非常不完整。实际上，当在更大和更新的数据集上进行评估时，大多数现有方法的表现较差。为了解决这个问题，我们构建了一个最新的数据集作为基准，随后使用支持向量机（SVM）开发了一种名为“PredPPCrys”的新方法。通过结合一组全面的多方面序列衍生特征以及一种新颖的多步特征选择策略，我们确定并表征了每种特征类型对成功结晶所需的五个单独实验步骤的预测性能的相对重要性和贡献。由此得到的最优候选特征被用作构建一级SVM预测器（PredPPCrys I）的输入。接下来，PredPPCrys I的预测输出被用作构建二级SVM分类器（PredPPCrys II）的输入，这显著提高了预测性能。基准实验表明，我们的PredPPCrys方法在最新数据集和先前数据集上均优于大多数现有方法。此外，还提供了当前不可结晶蛋白质的预测结晶靶点作为汇总数据，预计这将有助于全球结构基因组学联盟进行靶点选择和设计。PredPPCrys可在http://www.structbioinfor.org/PredPPCrys上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4b3/4141844/b4fa3750b072/pone.0105902.g001.jpg

相似文献

PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

PLoS One. 2014 Aug 22;9(8):e105902. doi: 10.1371/journal.pone.0105902. eCollection 2014.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa076.

Crysalis: an integrated server for computational analysis and design of protein crystallization.

Sci Rep. 2016 Feb 24;6:21383. doi: 10.1038/srep21383.

Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity.

Brief Bioinform. 2018 Sep 28;19(5):838-852. doi: 10.1093/bib/bbx018.

CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics.

Protein Pept Lett. 2012 Jan;19(1):40-9. doi: 10.2174/092986612798472910.

Sequence-based prediction of protein crystallization, purification and production propensity.

Bioinformatics. 2011 Jul 1;27(13):i24-33. doi: 10.1093/bioinformatics/btr229.

SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence.

Protein Pept Lett. 2010 Apr;17(4):423-30. doi: 10.2174/092986610790963726.

fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization.

BMC Bioinformatics. 2018 Jan 3;18(1):580. doi: 10.1186/s12859-017-1995-z.

CrystalM: A Multi-View Fusion Approach for Protein Crystallization Prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):325-335. doi: 10.1109/TCBB.2019.2912173. Epub 2021 Feb 3.

TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM.

Amino Acids. 2016 Nov;48(11):2533-2547. doi: 10.1007/s00726-016-2274-4. Epub 2016 Jun 14.

引用本文的文献

Benchmarking protein language models for protein crystallization.

Sci Rep. 2025 Jan 18;15(1):2381. doi: 10.1038/s41598-025-86519-5.

Integrating machine learning to advance epitope mapping.

Front Immunol. 2024 Sep 30;15:1463931. doi: 10.3389/fimmu.2024.1463931. eCollection 2024.

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction.

Interdiscip Sci. 2024 Dec;16(4):802-813. doi: 10.1007/s12539-024-00639-6. Epub 2024 Aug 19.

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE.

BMC Biol. 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8.

TLCrys: Transfer Learning Based Method for Protein Crystallization Prediction.

Int J Mol Sci. 2022 Jan 16;23(2):972. doi: 10.3390/ijms23020972.

Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity.

Interdiscip Sci. 2021 Dec;13(4):693-702. doi: 10.1007/s12539-021-00448-1. Epub 2021 Jun 18.

BCrystal: an interpretable sequence-based protein crystallization predictor.

Bioinformatics. 2020 Mar 1;36(5):1429-1438. doi: 10.1093/bioinformatics/btz762.

ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank.

Database (Oxford). 2019 Jan 1;2019:bay142. doi: 10.1093/database/bay142.

TMCrys: predict propensity of success for transmembrane protein crystallization.

Bioinformatics. 2018 Sep 15;34(18):3126-3130. doi: 10.1093/bioinformatics/bty342.

fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization.

BMC Bioinformatics. 2018 Jan 3;18(1):580. doi: 10.1186/s12859-017-1995-z.

本文引用的文献

Improving the chances of successful protein structure determination with a random forest classifier.

Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.

SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs.

PLoS One. 2013 Sep 3;8(9):e72368. doi: 10.1371/journal.pone.0072368. eCollection 2013.

Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction.

Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7.

hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties.

Biochim Biophys Acta. 2013 Aug;1834(8):1461-7. doi: 10.1016/j.bbapap.2013.04.006. Epub 2013 Apr 19.

PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites.

PLoS One. 2012;7(11):e50300. doi: 10.1371/journal.pone.0050300. Epub 2012 Nov 29.

The RCSB Protein Data Bank: new resources for research and education.

Nucleic Acids Res. 2013 Jan;41(Database issue):D475-82. doi: 10.1093/nar/gks1200. Epub 2012 Nov 27.

An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins.

PLoS One. 2012;7(11):e49716. doi: 10.1371/journal.pone.0049716. Epub 2012 Nov 14.

FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model.

PLoS One. 2012;7(8):e43847. doi: 10.1371/journal.pone.0043847. Epub 2012 Aug 24.

RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest.

J Theor Biol. 2012 Aug 7;306:115-9. doi: 10.1016/j.jtbi.2012.04.028. Epub 2012 May 2.

Prediction of protein modification sites of pyrrolidone carboxylic acid using mRMR feature selection and analysis.

PLoS One. 2011;6(12):e28221. doi: 10.1371/journal.pone.0028221. Epub 2011 Dec 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PredPPCrys：利用多步异构特征融合与选择从蛋白质序列准确预测序列克隆、蛋白质生产、纯化及结晶倾向。

PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献