• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

UQlust:将轮廓哈希与线性时间排序相结合,用于对大型大分子数据进行高效聚类和分析。

UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.

作者信息

Adamczak Rafal, Meller Jarek

机构信息

Department of Informatics, Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University, Grudziadzka 5, 87-100, Torun, Poland.

Departments of Environmental Health and Electrical Engineering & Computing Systems, University of Cincinnati, Cincinnati, USA.

出版信息

BMC Bioinformatics. 2016 Dec 28;17(1):546. doi: 10.1186/s12859-016-1381-2.

DOI:10.1186/s12859-016-1381-2
PMID:28031034
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5198500/
Abstract

BACKGROUND

Advances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces. The quickly growing number of experimentally resolved structures, and databases such as the Protein Data Bank, also implies large scale structural similarity analyses to retrieve and classify macromolecular data. Consequently, the computational cost of structure comparison and clustering for large sets of macromolecular structures has become a bottleneck that necessitates further algorithmic improvements and development of efficient software solutions.

RESULTS

uQlust is a versatile and easy-to-use tool for ultrafast ranking and clustering of macromolecular structures. uQlust makes use of structural profiles of proteins and nucleic acids, while combining a linear-time algorithm for implicit comparison of all pairs of models with profile hashing to enable efficient clustering of large data sets with a low memory footprint. In addition to ranking and clustering of large sets of models of the same protein or RNA molecule, uQlust can also be used in conjunction with fragment-based profiles in order to cluster structures of arbitrary length. For example, hierarchical clustering of the entire PDB using profile hashing can be performed on a typical laptop, thus opening an avenue for structural explorations previously limited to dedicated resources. The uQlust package is freely available under the GNU General Public License at https://github.com/uQlust .

CONCLUSION

uQlust represents a drastic reduction in the computational complexity and memory requirements with respect to existing clustering and model quality assessment methods for macromolecular structure analysis, while yielding results on par with traditional approaches for both proteins and RNAs.

摘要

背景

计算技术的进步使当前的蛋白质和RNA结构预测以及分子模拟方法能够极大地增加其对构象空间的采样。实验解析结构数量的快速增长,以及诸如蛋白质数据库等数据库,也意味着需要进行大规模的结构相似性分析以检索和分类大分子数据。因此,对大量大分子结构进行结构比较和聚类的计算成本已成为一个瓶颈,这就需要进一步改进算法并开发高效的软件解决方案。

结果

uQlust是一种用于对大分子结构进行超快速排序和聚类的通用且易于使用的工具。uQlust利用蛋白质和核酸的结构概况,同时将一种用于隐式比较所有模型对与概况哈希的线性时间算法相结合,以实现对大数据集的高效聚类,且内存占用低。除了对同一蛋白质或RNA分子的大量模型进行排序和聚类外,uQlust还可与基于片段的概况结合使用,以便对任意长度的结构进行聚类。例如,使用概况哈希对整个蛋白质数据库进行层次聚类可以在一台典型的笔记本电脑上完成,从而为以前仅限于专用资源的结构探索开辟了一条途径。uQlust软件包可在GNU通用公共许可证下从https://github.com/uQlust免费获取。

结论

相对于现有的大分子结构分析聚类和模型质量评估方法,uQlust在计算复杂度和内存需求方面有了大幅降低,同时在蛋白质和RNA方面产生的结果与传统方法相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/9068b031ad18/12859_2016_1381_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/fe20927e48fa/12859_2016_1381_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/9068b031ad18/12859_2016_1381_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/fe20927e48fa/12859_2016_1381_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1fa/5198500/9068b031ad18/12859_2016_1381_Fig2_HTML.jpg

相似文献

1
UQlust: combining profile hashing with linear-time ranking for efficient clustering and analysis of big macromolecular data.UQlust:将轮廓哈希与线性时间排序相结合,用于对大型大分子数据进行高效聚类和分析。
BMC Bioinformatics. 2016 Dec 28;17(1):546. doi: 10.1186/s12859-016-1381-2.
2
Structural alignment of protein descriptors - a combinatorial model.蛋白质描述符的结构比对——一种组合模型
BMC Bioinformatics. 2016 Sep 17;17:383. doi: 10.1186/s12859-016-1237-9.
3
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment.在 Azure 数据湖环境中可扩展地提取大分子数据。
Molecules. 2019 Jan 5;24(1):179. doi: 10.3390/molecules24010179.
4
A permissive secondary structure-guided superposition tool for clustering of protein fragments toward protein structure prediction via fragment assembly.一种用于通过片段组装对蛋白质片段进行聚类以预测蛋白质结构的允许性二级结构引导的叠加工具。
Bioinformatics. 2006 Jun 1;22(11):1343-52. doi: 10.1093/bioinformatics/btl098. Epub 2006 Mar 16.
5
Clustering 100,000 protein structure decoys in minutes.在数分钟内对 10 万个蛋白质结构 decoys 进行聚类。
IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):765-73. doi: 10.1109/TCBB.2011.142.
6
RNACluster: An integrated tool for RNA secondary structure comparison and clustering.RNA聚类:一种用于RNA二级结构比较和聚类的集成工具。
J Comput Chem. 2008 Jul 15;29(9):1517-26. doi: 10.1002/jcc.20911.
7
Multiple structural alignment and core detection by geometric hashing.基于几何哈希的多重结构比对与核心检测
Proc Int Conf Intell Syst Mol Biol. 1999:169-77.
8
A Fast Projection-Based Algorithm for Clustering Big Data.一种基于快速投影的大数据聚类算法。
Interdiscip Sci. 2019 Sep;11(3):360-366. doi: 10.1007/s12539-018-0294-3. Epub 2018 Jun 7.
9
Fuzzy kernel clustering of RNA secondary structure ensemble using a novel similarity metric.使用一种新型相似性度量对RNA二级结构集合进行模糊核聚类。
J Biomol Struct Dyn. 2008 Jun;25(6):685-96. doi: 10.1080/07391102.2008.10507214.
10
A fast hierarchical clustering algorithm for large-scale protein sequence data sets.一种用于大规模蛋白质序列数据集的快速分层聚类算法。
Comput Biol Med. 2014 May;48:94-101. doi: 10.1016/j.compbiomed.2014.02.016. Epub 2014 Mar 4.

本文引用的文献

1
Modeling complex RNA tertiary folds with Rosetta.使用Rosetta对复杂的RNA三级结构进行建模。
Methods Enzymol. 2015;553:35-64. doi: 10.1016/bs.mie.2014.10.051. Epub 2015 Feb 12.
2
Markov state models provide insights into dynamic modulation of protein function.马尔可夫状态模型有助于深入了解蛋白质功能的动态调节。
Acc Chem Res. 2015 Feb 17;48(2):414-22. doi: 10.1021/ar5002999. Epub 2015 Jan 3.
3
A series of PDB-related databanks for everyday needs.一系列满足日常需求的与蛋白质数据银行(PDB)相关的数据库。
Nucleic Acids Res. 2015 Jan;43(Database issue):D364-8. doi: 10.1093/nar/gku1028. Epub 2014 Oct 28.
4
Evaluation of predictions in the CASP10 model refinement category.在蛋白质结构预测技术关键评估(CASP)10模型优化类别中对预测结果的评估。
Proteins. 2014 Feb;82 Suppl 2(Suppl 2):98-111. doi: 10.1002/prot.24377. Epub 2014 Jan 3.
5
PconsD: ultra rapid, accurate model quality assessment for protein structure prediction.PconsD:用于蛋白质结构预测的超快速、准确的模型质量评估。
Bioinformatics. 2013 Jul 15;29(14):1817-8. doi: 10.1093/bioinformatics/btt272. Epub 2013 May 14.
6
CABS-flex: Server for fast simulation of protein structure fluctuations.CABS-flex:用于快速模拟蛋白质结构波动的服务器。
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W427-31. doi: 10.1093/nar/gkt332. Epub 2013 May 8.
7
ClusCo: clustering and comparison of protein models.ClusCo:蛋白质模型的聚类和比较。
BMC Bioinformatics. 2013 Feb 22;14:62. doi: 10.1186/1471-2105-14-62.
8
Discrete RNA libraries from pseudo-torsional space.来自拟扭转空间的离散 RNA 文库。
J Mol Biol. 2012 Aug 3;421(1):6-26. doi: 10.1016/j.jmb.2012.03.002. Epub 2012 Mar 13.
9
Fast large-scale clustering of protein structures using Gauss integrals.使用 Gauss 积分进行快速大规模蛋白质结构聚类。
Bioinformatics. 2012 Feb 15;28(4):510-5. doi: 10.1093/bioinformatics/btr692. Epub 2011 Dec 22.
10
Fast geometric consensus approach for protein model quality assessment.用于蛋白质模型质量评估的快速几何一致性方法。
J Comput Biol. 2011 Dec;18(12):1807-18. doi: 10.1089/cmb.2010.0170. Epub 2011 Jan 18.