基于种群的蛋白质结构模型分析的快速算法。

Fast algorithm for population-based protein structural model analysis.

机构信息

Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA.

出版信息

Proteomics. 2013 Jan;13(2):221-9. doi: 10.1002/pmic.201200334. Epub 2013 Jan 3.

DOI:10.1002/pmic.201200334

PMID:23184517

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3641909/

Abstract

De novo protein structure prediction often generates a large population of candidates (models), and then selects near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. In this paper, we present a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise root mean square deviation and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and root mean square deviation and the correlation between Dscore2 and TM-score are high. Compared to the existing methods with calculation time quadratic to the number of models, our Dscore1-based clustering achieves a linearly time complexity while obtaining almost the same accuracy for near-native model selection. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500 k) models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php.

摘要

从头蛋白质结构预测通常会产生大量的候选物（模型），然后通过聚类来选择接近天然的模型。由于模型之间的两两距离计算，现有的结构模型聚类方法耗时。在本文中，我们提出了一种新的快速模型聚类方法，而不会降低聚类准确性。我们提出了两种新的距离度量方法，Dscore1 和 Dscore2，分别基于蛋白质距离矩阵的比较来描述模型之间的差异和相似性，而不是常用的两两均方根偏差和 TM 分数值。分析表明，Dscore1 与均方根偏差之间的相关性和 Dscore2 与 TM 分数之间的相关性都很高。与计算时间与模型数量的平方成正比的现有方法相比，我们的基于 Dscore1 的聚类方法实现了线性时间复杂度，同时在接近天然模型选择方面获得了几乎相同的准确性。通过使用 Dscore2 来选择聚类的代表，我们可以在计算时间略有增加的情况下进一步提高代表的质量。此外，对于大型 (~500k) 模型，我们可以在几秒钟到几分钟内根据 Dscore 分布进行快速数据可视化。我们的方法已经在一个名为 MUFOLD-CL 的软件包中实现，可在 http://mufold.org/clustering.php 上获得。

相似文献

Fast algorithm for population-based protein structural model analysis.

Proteomics. 2013 Jan;13(2):221-9. doi: 10.1002/pmic.201200334. Epub 2013 Jan 3.

Clustering 100,000 protein structure decoys in minutes.

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):765-73. doi: 10.1109/TCBB.2011.142.

A Fast Projection-Based Algorithm for Clustering Big Data.

Interdiscip Sci. 2019 Sep;11(3):360-366. doi: 10.1007/s12539-018-0294-3. Epub 2018 Jun 7.

SCUD: fast structure clustering of decoys using reference state to remove overall rotation.

J Comput Chem. 2005 Aug;26(11):1189-92. doi: 10.1002/jcc.20251.

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score.

BMC Bioinformatics. 2008 Dec 12;9:531. doi: 10.1186/1471-2105-9-531.

Fast large-scale clustering of protein structures using Gauss integrals.

Bioinformatics. 2012 Feb 15;28(4):510-5. doi: 10.1093/bioinformatics/btr692. Epub 2011 Dec 22.

Automated clustering of ensembles of alternative models in protein structure databases.

Protein Eng Des Sel. 2004 Jun;17(6):537-43. doi: 10.1093/protein/gzh063. Epub 2004 Aug 19.

Granular clustering of de novo protein models.

Bioinformatics. 2017 Feb 1;33(3):390-396. doi: 10.1093/bioinformatics/btw628.

Accelerated protein structure comparison using TM-score-GPU.

Bioinformatics. 2012 Aug 15;28(16):2191-2. doi: 10.1093/bioinformatics/bts345. Epub 2012 Jun 19.

引用本文的文献

Estimation of model accuracy by a unique set of features and tree-based regressor.

Sci Rep. 2022 Aug 18;12(1):14074. doi: 10.1038/s41598-022-17097-z.

Decoy selection for protein structure prediction via extreme gradient boosting and ranking.

BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):189. doi: 10.1186/s12859-020-3523-9.

QDeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks.

Bioinformatics. 2020 Jul 1;36(Suppl_1):i285-i291. doi: 10.1093/bioinformatics/btaa455.

Ranking near-native candidate protein structures via random forest classification.

BMC Bioinformatics. 2019 Dec 24;20(Suppl 25):683. doi: 10.1186/s12859-019-3257-8.

Unsupervised and Supervised Learning over theEnergy Landscape for Protein Decoy Selection.

Biomolecules. 2019 Oct 14;9(10):607. doi: 10.3390/biom9100607.

Tight clustering for large datasets with an application to gene expression data.

Sci Rep. 2019 Feb 28;9(1):3053. doi: 10.1038/s41598-019-39459-w.

Identify High-Quality Protein Structural Models by Enhanced -Means.

Biomed Res Int. 2017;2017:7294519. doi: 10.1155/2017/7294519. Epub 2017 Mar 22.

UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling.

Bioinformatics. 2016 Sep 15;32(18):2791-9. doi: 10.1093/bioinformatics/btw316. Epub 2016 Jun 3.

Massive integration of diverse protein quality assessment methods to improve template based modeling in CASP11.

Proteins. 2016 Sep;84 Suppl 1(Suppl 1):247-59. doi: 10.1002/prot.24924. Epub 2015 Sep 29.

Large-scale model quality assessment for improving protein tertiary structure prediction.

Bioinformatics. 2015 Jun 15;31(12):i116-23. doi: 10.1093/bioinformatics/btv235.

本文引用的文献

Protein Folding: A Perspective from Theory and Experiment.

Angew Chem Int Ed Engl. 1998 Apr 20;37(7):868-893. doi: 10.1002/(SICI)1521-3773(19980420)37:7<868::AID-ANIE868>3.0.CO;2-H.

Fast large-scale clustering of protein structures using Gauss integrals.

Bioinformatics. 2012 Feb 15;28(4):510-5. doi: 10.1093/bioinformatics/btr692. Epub 2011 Dec 22.

Clustering 100,000 protein structure decoys in minutes.

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):765-73. doi: 10.1109/TCBB.2011.142.

Entropy-accelerated exact clustering of protein decoys.

Bioinformatics. 2011 Apr 1;27(7):939-45. doi: 10.1093/bioinformatics/btr072. Epub 2011 Feb 9.

I-TASSER: a unified platform for automated protein structure and function prediction.

Nat Protoc. 2010 Apr;5(4):725-38. doi: 10.1038/nprot.2010.5. Epub 2010 Mar 25.

Calibur: a tool for clustering large numbers of protein decoys.

BMC Bioinformatics. 2010 Jan 13;11:25. doi: 10.1186/1471-2105-11-25.

Ab initio modeling of small proteins by iterative TASSER simulations.

BMC Biol. 2007 May 8;5:17. doi: 10.1186/1741-7007-5-17.

Computational methods in protein structure prediction.

Biotechnol Bioeng. 2007 Jun 1;97(2):207-13. doi: 10.1002/bit.21411.

Sampling realistic protein conformations using local structural bias.

PLoS Comput Biol. 2006 Sep 22;2(9):e131. doi: 10.1371/journal.pcbi.0020131. Epub 2006 Aug 21.

SCUD: fast structure clustering of decoys using reference state to remove overall rotation.

J Comput Chem. 2005 Aug;26(11):1189-92. doi: 10.1002/jcc.20251.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于种群的蛋白质结构模型分析的快速算法。

Fast algorithm for population-based protein structural model analysis.

机构信息

Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA.

出版信息

Proteomics. 2013 Jan;13(2):221-9. doi: 10.1002/pmic.201200334. Epub 2013 Jan 3.

DOI:10.1002/pmic.201200334

PMID:23184517

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3641909/

Abstract

摘要

基于种群的蛋白质结构模型分析的快速算法。

Fast algorithm for population-based protein structural model analysis.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于种群的蛋白质结构模型分析的快速算法。

Fast algorithm for population-based protein structural model analysis.

机构信息

出版信息