UQlust：将轮廓哈希与线性时间排序相结合，用于对大型大分子数据进行高效聚类和分析。

UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.

作者信息

Adamczak Rafal, Meller Jarek

机构信息

Department of Informatics, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University, Grudziadzka 5, 87-100, Torun, Poland.

Departments of Environmental Health and Electrical Engineering & Computing Systems, University of Cincinnati, Cincinnati, USA.

出版信息

BMC Bioinformatics. 2016 Dec 28;17(1):546. doi: 10.1186/s12859-016-1381-2.

DOI:10.1186/s12859-016-1381-2

PMID:28031034

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5198500/

Abstract

BACKGROUND

Advances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces. The quickly growing number of experimentally resolved structures, and databases such as the Protein Data Bank, also implies large scale structural similarity analyses to retrieve and classify macromolecular data. Consequently, the computational cost of structure comparison and clustering for large sets of macromolecular structures has become a bottleneck that necessitates further algorithmic improvements and development of efficient software solutions.

RESULTS

uQlust is a versatile and easy-to-use tool for ultrafast ranking and clustering of macromolecular structures. uQlust makes use of structural profiles of proteins and nucleic acids, while combining a linear-time algorithm for implicit comparison of all pairs of models with profile hashing to enable efficient clustering of large data sets with a low memory footprint. In addition to ranking and clustering of large sets of models of the same protein or RNA molecule, uQlust can also be used in conjunction with fragment-based profiles in order to cluster structures of arbitrary length. For example, hierarchical clustering of the entire PDB using profile hashing can be performed on a typical laptop, thus opening an avenue for structural explorations previously limited to dedicated resources. The uQlust package is freely available under the GNU General Public License at https://github.com/uQlust .

CONCLUSION

uQlust represents a drastic reduction in the computational complexity and memory requirements with respect to existing clustering and model quality assessment methods for macromolecular structure analysis, while yielding results on par with traditional approaches for both proteins and RNAs.

摘要

背景

计算技术的进步使当前的蛋白质和RNA结构预测以及分子模拟方法能够极大地增加其对构象空间的采样。实验解析结构数量的快速增长，以及诸如蛋白质数据库等数据库，也意味着需要进行大规模的结构相似性分析以检索和分类大分子数据。因此，对大量大分子结构进行结构比较和聚类的计算成本已成为一个瓶颈，这就需要进一步改进算法并开发高效的软件解决方案。

结果

uQlust是一种用于对大分子结构进行超快速排序和聚类的通用且易于使用的工具。uQlust利用蛋白质和核酸的结构概况，同时将一种用于隐式比较所有模型对与概况哈希的线性时间算法相结合，以实现对大数据集的高效聚类，且内存占用低。除了对同一蛋白质或RNA分子的大量模型进行排序和聚类外，uQlust还可与基于片段的概况结合使用，以便对任意长度的结构进行聚类。例如，使用概况哈希对整个蛋白质数据库进行层次聚类可以在一台典型的笔记本电脑上完成，从而为以前仅限于专用资源的结构探索开辟了一条途径。uQlust软件包可在GNU通用公共许可证下从https://github.com/uQlust免费获取。

结论

相对于现有的大分子结构分析聚类和模型质量评估方法，uQlust在计算复杂度和内存需求方面有了大幅降低，同时在蛋白质和RNA方面产生的结果与传统方法相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/fe20927e48fa/12859_2016_1381_Fig1_HTML.jpg

相似文献

UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.UQlust：将轮廓哈希与线性时间排序相结合，用于对大型大分子数据进行高效聚类和分析。

BMC Bioinformatics. 2016 Dec 28;17(1):546. doi: 10.1186/s12859-016-1381-2.

Structural alignment of protein descriptors - a combinatorial model.蛋白质描述符的结构比对——一种组合模型

BMC Bioinformatics. 2016 Sep 17;17:383. doi: 10.1186/s12859-016-1237-9.

Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment.在 Azure 数据湖环境中可扩展地提取大分子数据。

Molecules. 2019 Jan 5;24(1):179. doi: 10.3390/molecules24010179.

A permissive secondary structure-guided superposition tool for clustering of protein fragments toward protein structure prediction via fragment assembly.一种用于通过片段组装对蛋白质片段进行聚类以预测蛋白质结构的允许性二级结构引导的叠加工具。

Bioinformatics. 2006 Jun 1;22(11):1343-52. doi: 10.1093/bioinformatics/btl098. Epub 2006 Mar 16.

Clustering 100,000 protein structure decoys in minutes.在数分钟内对 10 万个蛋白质结构 decoys 进行聚类。

IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):765-73. doi: 10.1109/TCBB.2011.142.

RNACluster: An integrated tool for RNA secondary structure comparison and clustering.RNA聚类：一种用于RNA二级结构比较和聚类的集成工具。

J Comput Chem. 2008 Jul 15;29(9):1517-26. doi: 10.1002/jcc.20911.

Multiple structural alignment and core detection by geometric hashing.基于几何哈希的多重结构比对与核心检测

Proc Int Conf Intell Syst Mol Biol. 1999:169-77.

A Fast Projection-Based Algorithm for Clustering Big Data.一种基于快速投影的大数据聚类算法。

Interdiscip Sci. 2019 Sep;11(3):360-366. doi: 10.1007/s12539-018-0294-3. Epub 2018 Jun 7.

Fuzzy kernel clustering of RNA secondary structure ensemble using a novel similarity metric.使用一种新型相似性度量对RNA二级结构集合进行模糊核聚类。

J Biomol Struct Dyn. 2008 Jun;25(6):685-96. doi: 10.1080/07391102.2008.10507214.

A fast hierarchical clustering algorithm for large-scale protein sequence data sets.一种用于大规模蛋白质序列数据集的快速分层聚类算法。

Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

本文引用的文献

Modeling complex RNA tertiary folds with Rosetta.使用Rosetta对复杂的RNA三级结构进行建模。

Methods Enzymol. 2015;553:35-64. doi: 10.1016/bs.mie.2014.10.051. Epub 2015 Feb 12.

Markov state models provide insights into dynamic modulation of protein function.马尔可夫状态模型有助于深入了解蛋白质功能的动态调节。

Acc Chem Res. 2015 Feb 17;48(2):414-22. doi: 10.1021/ar5002999. Epub 2015 Jan 3.

A series of PDB-related databanks for everyday needs.一系列满足日常需求的与蛋白质数据银行（PDB）相关的数据库。

Nucleic Acids Res. 2015 Jan;43(Database issue):D364-8. doi: 10.1093/nar/gku1028. Epub 2014 Oct 28.

Evaluation of predictions in the CASP10 model refinement category.在蛋白质结构预测技术关键评估（CASP）10模型优化类别中对预测结果的评估。

Proteins. 2014 Feb;82 Suppl 2(Suppl 2):98-111. doi: 10.1002/prot.24377. Epub 2014 Jan 3.

PconsD: ultra rapid, accurate model quality assessment for protein structure prediction.PconsD：用于蛋白质结构预测的超快速、准确的模型质量评估。

Bioinformatics. 2013 Jul 15;29(14):1817-8. doi: 10.1093/bioinformatics/btt272. Epub 2013 May 14.

CABS-flex: Server for fast simulation of protein structure fluctuations.CABS-flex：用于快速模拟蛋白质结构波动的服务器。

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W427-31. doi: 10.1093/nar/gkt332. Epub 2013 May 8.

ClusCo: clustering and comparison of protein models.ClusCo：蛋白质模型的聚类和比较。

BMC Bioinformatics. 2013 Feb 22;14:62. doi: 10.1186/1471-2105-14-62.

Discrete RNA libraries from pseudo-torsional space.来自拟扭转空间的离散 RNA 文库。

J Mol Biol. 2012 Aug 3;421(1):6-26. doi: 10.1016/j.jmb.2012.03.002. Epub 2012 Mar 13.

Fast large-scale clustering of protein structures using Gauss integrals.使用 Gauss 积分进行快速大规模蛋白质结构聚类。

Bioinformatics. 2012 Feb 15;28(4):510-5. doi: 10.1093/bioinformatics/btr692. Epub 2011 Dec 22.

Fast geometric consensus approach for protein model quality assessment.用于蛋白质模型质量评估的快速几何一致性方法。

J Comput Biol. 2011 Dec;18(12):1807-18. doi: 10.1089/cmb.2010.0170. Epub 2011 Jan 18.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

UQlust：将轮廓哈希与线性时间排序相结合，用于对大型大分子数据进行高效聚类和分析。

UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献