利用随机森林分类器提高蛋白质结构测定成功的几率。

Improving the chances of successful protein structure determination with a random forest classifier.

作者信息

Jahandideh Samad, Jaroszewski Lukasz, Godzik Adam

机构信息

Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA.

出版信息

Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.

DOI:10.1107/S1399004713032070

PMID:24598732

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3949519/

Abstract

Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007), Protein Sci. 16, 2472-2482] was developed. XtalPred classifies proteins into five `crystallization classes' based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.

摘要

获得具有衍射质量的晶体仍然是结构生物学中的主要瓶颈之一。从蛋白质的氨基酸序列预测结晶可能性的能力，至少在一定程度上可以解决这个问题，方法是让晶体学家选择更有可能成功的同源物和/或修改目标序列，以避免不利于成功结晶的特征。2007年，现在广泛使用的XtalPred算法[斯拉宾斯基等人（2007年），《蛋白质科学》16卷，2472 - 2482页]被开发出来。XtalPred基于对蛋白质物理化学特征的简单统计分析，将蛋白质分为五个“结晶类别”。在此，为了实现同样的目标，应用了先进的机器学习方法，此外，还测试了其他蛋白质特征的预测潜力，如预测的表面粗糙度、疏水性、表面残基的侧链熵以及预测的蛋白质表面的氨基酸组成。新的XtalPred - RF（随机森林）在结晶成功预测方面比原始的XtalPred有显著改进。为了说明这一点，通过重新审视结构基因组学联合中心（JCSG）在PSI - 2中针对的271个Pfam家族的目标选择来测试XtalPred - RF，据估计，进入蛋白质生产和结晶流程的目标数量可以减少30%，而不会减少首次解析出结构的家族数量。预测改进取决于用作测试集的目标子集，对于预测目标的顶级类别，预测改进达到100%（即两倍）。

相似文献

Improving the chances of successful protein structure determination with a random forest classifier.

Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.

SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence.

Protein Pept Lett. 2010 Apr;17(4):423-30. doi: 10.2174/092986610790963726.

Meta prediction of protein crystallization propensity.

Biochem Biophys Res Commun. 2009 Dec 4;390(1):10-5. doi: 10.1016/j.bbrc.2009.09.036. Epub 2009 Sep 13.

CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics.

Protein Pept Lett. 2012 Jan;19(1):40-9. doi: 10.2174/092986612798472910.

CRYSTALP2: sequence-based protein crystallization propensity prediction.

BMC Struct Biol. 2009 Jul 31;9:50. doi: 10.1186/1472-6807-9-50.

XtalPred: a web server for prediction of protein crystallizability.

Bioinformatics. 2007 Dec 15;23(24):3403-5. doi: 10.1093/bioinformatics/btm477. Epub 2007 Oct 5.

PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

PLoS One. 2014 Aug 22;9(8):e105902. doi: 10.1371/journal.pone.0105902. eCollection 2014.

Sequence-based prediction of protein crystallization, purification and production propensity.

Bioinformatics. 2011 Jul 1;27(13):i24-33. doi: 10.1093/bioinformatics/btr229.

XANNpred: neural nets that predict the propensity of a protein to yield diffraction-quality crystals.

Proteins. 2011 Apr;79(4):1027-33. doi: 10.1002/prot.22914. Epub 2011 Jan 18.

Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences.

Comput Math Methods Med. 2021 May 7;2021:5529389. doi: 10.1155/2021/5529389. eCollection 2021.

引用本文的文献

High-Throughput Pipeline for Protein Expression and Solubility Profiling Using Synthetically Generated Plasmids.

Curr Protoc. 2025 Jul;5(7):e70188. doi: 10.1002/cpz1.70188.

Investigations on genomic, topological and structural properties of diguanylate cyclases involved in biofilm signalling using techniques: Promising drug targets in combating cholera.

Curr Res Struct Biol. 2025 Apr 9;9:100166. doi: 10.1016/j.crstbi.2025.100166. eCollection 2025 Jun.

Deep learning applications in protein crystallography.

Acta Crystallogr A Found Adv. 2024 Jan 1;80(Pt 1):1-17. doi: 10.1107/S2053273323009300.

Predictive Model of Functional Exercise Compliance of Patients with Breast Cancer Based on Decision Tree.

Int J Womens Health. 2023 Mar 21;15:397-410. doi: 10.2147/IJWH.S386405. eCollection 2023.

A Structural Systems Biology Approach to High-Risk CG23 Klebsiella pneumoniae.

Microbiol Resour Announc. 2023 Feb 16;12(2):e0101322. doi: 10.1128/mra.01013-22. Epub 2023 Jan 25.

Insights into the Structure of Rubisco from Dinoflagellates-In Silico Studies.

Int J Mol Sci. 2021 Aug 7;22(16):8524. doi: 10.3390/ijms22168524.

Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity.

Interdiscip Sci. 2021 Dec;13(4):693-702. doi: 10.1007/s12539-021-00448-1. Epub 2021 Jun 18.

Regioselectivity of hyoscyamine 6β-hydroxylase-catalysed hydroxylation as revealed by high-resolution structural information and QM/MM calculations.

Dalton Trans. 2020 Apr 7;49(14):4454-4469. doi: 10.1039/d0dt00302f.

BCrystal: an interpretable sequence-based protein crystallization predictor.

Bioinformatics. 2020 Mar 1;36(5):1429-1438. doi: 10.1093/bioinformatics/btz762.

TMCrys: predict propensity of success for transmembrane protein crystallization.

Bioinformatics. 2018 Sep 15;34(18):3126-3130. doi: 10.1093/bioinformatics/bty342.

本文引用的文献

Purification, crystallization and preliminary crystallographic analysis of the CBS-domain pair of cyclin M2 (CNNM2).

Acta Crystallogr Sect F Struct Biol Cryst Commun. 2012 Oct 1;68(Pt 10):1198-203. doi: 10.1107/S1744309112035348. Epub 2012 Sep 26.

RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest.

J Theor Biol. 2012 Aug 7;306:115-9. doi: 10.1016/j.jtbi.2012.04.028. Epub 2012 May 2.

Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions.

Comput Biol Chem. 2012 Feb;36:36-41. doi: 10.1016/j.compbiolchem.2011.12.003. Epub 2012 Jan 3.

Sequence-based prediction of protein crystallization, purification and production propensity.

Bioinformatics. 2011 Jul 1;27(13):i24-33. doi: 10.1093/bioinformatics/btr229.

The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods.

J Struct Funct Genomics. 2011 Jul;12(2):45-54. doi: 10.1007/s10969-011-9106-2. Epub 2011 Apr 7.

It's all in the crystals….

Acta Crystallogr D Biol Crystallogr. 2011 Apr;67(Pt 4):243-8. doi: 10.1107/S0907444911007797. Epub 2011 Mar 18.

Purification, crystallization and preliminary crystallographic analysis of the CBS pair of the human metal transporter CNNM4.

Acta Crystallogr Sect F Struct Biol Cryst Commun. 2011 Mar 1;67(Pt 3):349-53. doi: 10.1107/S1744309110053856. Epub 2011 Feb 23.

Purification, crystallization and preliminary crystallographic analysis of the CBS-domain protein MJ1004 from Methanocaldococcus jannaschii.

Acta Crystallogr Sect F Struct Biol Cryst Commun. 2011 Mar 1;67(Pt 3):318-24. doi: 10.1107/S1744309110053479. Epub 2011 Feb 23.

AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties.

J Theor Biol. 2011 Feb 7;270(1):56-62. doi: 10.1016/j.jtbi.2010.10.037. Epub 2010 Nov 4.

The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium.

J Struct Biol. 2010 Oct;172(1):21-33. doi: 10.1016/j.jsb.2010.07.011. Epub 2010 Aug 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用随机森林分类器提高蛋白质结构测定成功的几率。

Improving the chances of successful protein structure determination with a random forest classifier.

作者信息

Jahandideh Samad, Jaroszewski Lukasz, Godzik Adam

机构信息

Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA.

出版信息

Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.

DOI:10.1107/S1399004713032070

PMID:24598732

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3949519/

Abstract

摘要

利用随机森林分类器提高蛋白质结构测定成功的几率。

Improving the chances of successful protein structure determination with a random forest classifier.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用随机森林分类器提高蛋白质结构测定成功的几率。

Improving the chances of successful protein structure determination with a random forest classifier.

作者信息

机构信息

出版信息