ProdMX：基于压缩稀疏矩阵的蛋白质功能域快速查询与分析

ProdMX: Rapid query and analysis of protein functional domain based on compressed sparse matrices.

作者信息

Wanchai Visanu, Nookaew Intawat, Ussery David W

机构信息

Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA.

出版信息

Comput Struct Biotechnol J. 2020 Nov 24;18:3890-3896. doi: 10.1016/j.csbj.2020.10.023. eCollection 2020.

DOI:10.1016/j.csbj.2020.10.023

PMID:33335686

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7719867/

Abstract

Large-scale protein analysis has been used to characterize large numbers of proteins across numerous species. One of the applications is to use as a high-throughput screening method for pathogenicity of genomes. Unlike sequence homology methods, protein comparison at a functional level provides us with a unique opportunity to classify proteins, based on their functional structures without dealing with sequence complexity of distantly related species. Protein functions can be abstractly described by a set of protein functional domains, such as PfamA domains; a set of genomes can then be mapped to a matrix, with each row representing a genome, and the columns representing the presence or absence of a given functional domain. However, a powerful tool is needed to analyze the large sparse matrices generated by millions of genomes that will become available in the near future. The ProdMX is a tool with user-friendly utilities developed to facilitate high-throughput analysis of proteins with an ability to be included as an effective module in the high-throughput pipeline. The ProdMX employs a compressed sparse matrix algorithm to reduce computational resources and time used to perform the matrix manipulation during functional domain analysis. The ProdMX is a free and publicly available Python package which can be installed with popular package mangers such as PyPI and Conda, or with a standard installer from source code available on the ProdMX GitHub repository at https://github.com/visanuwan/prodmx.

摘要

大规模蛋白质分析已被用于表征众多物种中的大量蛋白质。其中一个应用是用作基因组致病性的高通量筛选方法。与序列同源性方法不同，在功能水平上进行蛋白质比较为我们提供了一个独特的机会，可根据蛋白质的功能结构对其进行分类，而无需处理远缘物种的序列复杂性。蛋白质功能可以通过一组蛋白质功能域（如PfamA结构域）进行抽象描述；然后可以将一组基因组映射到一个矩阵，其中每行代表一个基因组，列代表给定功能域的存在或不存在。然而，需要一个强大的工具来分析由数百万个基因组生成的大型稀疏矩阵，这些矩阵在不久的将来将会出现。ProdMX是一个具有用户友好实用程序的工具，开发用于促进蛋白质的高通量分析，并能够作为高通量流程中的一个有效模块。ProdMX采用压缩稀疏矩阵算法来减少在功能域分析期间执行矩阵操作所使用的计算资源和时间。ProdMX是一个免费的、公开可用的Python包，可以使用流行的包管理器（如PyPI和Conda）进行安装，也可以使用来自https://github.com/visanuwan/prodmx上ProdMX GitHub存储库的源代码通过标准安装程序进行安装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/207a/7719867/200903ab0fa8/gr1.jpg

相似文献

ProdMX: Rapid query and analysis of protein functional domain based on compressed sparse matrices.ProdMX：基于压缩稀疏矩阵的蛋白质功能域快速查询与分析

Comput Struct Biotechnol J. 2020 Nov 24;18:3890-3896. doi: 10.1016/j.csbj.2020.10.023. eCollection 2020.

NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.NeuroPycon：一个开源的 Python 工具包，用于快速进行多模态和可重复的脑连接管道。

Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.

TrajPy: empowering feature engineering for trajectory analysis across domains.TrajPy：助力跨领域轨迹分析的特征工程

Bioinform Adv. 2024 Feb 23;4(1):vbae026. doi: 10.1093/bioadv/vbae026. eCollection 2024.

Pygenprop: a Python library for programmatic exploration and comparison of organism genome properties.Pygenprop：一个用于程序化探索和比较生物基因组属性的 Python 库。

Bioinformatics. 2019 Dec 1;35(23):5063-5065. doi: 10.1093/bioinformatics/btz522.

Gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes.Gcluster：一个简单易用的工具，用于可视化和比较众多基因组的基因组上下文。

Bioinformatics. 2020 Jun 1;36(12):3871-3873. doi: 10.1093/bioinformatics/btaa212.

A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies.一种用于大规模比较原核生物基因组学研究的从头基因组分析流程（DeNoGAP）。

BMC Bioinformatics. 2016 Jun 30;17(1):260. doi: 10.1186/s12859-016-1142-2.

Expanding the Orthologous Matrix (OMA) programmatic interfaces: REST API and the packages for R and Python.扩展直系同源矩阵（OMA）编程接口：REST API以及R和Python包。

F1000Res. 2019 Jan 10;8:42. doi: 10.12688/f1000research.17548.2. eCollection 2019.

Efficient population-scale variant analysis and prioritization with VAPr.利用 VAPr 进行高效的群体规模变异分析和优先级排序。

Bioinformatics. 2018 Aug 15;34(16):2843-2845. doi: 10.1093/bioinformatics/bty192.

spectrum_utils: A Python Package for Mass Spectrometry Data Processing and Visualization.spectrum_utils：一个用于质谱数据分析和可视化的 Python 包。

Anal Chem. 2020 Jan 7;92(1):659-661. doi: 10.1021/acs.analchem.9b04884. Epub 2019 Dec 20.

Fast numerical optimization for genome sequencing data in population biobanks.群体生物库中基因组测序数据的快速数值优化。

Bioinformatics. 2021 Nov 18;37(22):4148-4155. doi: 10.1093/bioinformatics/btab452.

本文引用的文献

SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0：Python 中的科学计算基础算法。

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

Clostridioides difficile Infection.艰难梭菌感染。

Ann Intern Med. 2018 Oct 2;169(7):ITC49-ITC64. doi: 10.7326/AITC201810020.

dBBQs: dataBase of Bacterial Quality scores.dBBQs：细菌质量评分数据库。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):483. doi: 10.1186/s12859-017-1900-9.

Botulinum Neurotoxins: Biology, Pharmacology, and Toxicology.肉毒杆菌神经毒素：生物学、药理学与毒理学

Pharmacol Rev. 2017 Apr;69(2):200-235. doi: 10.1124/pr.116.012658.

The Regulatory Networks That Control Clostridium difficile Toxin Synthesis.控制艰难梭菌毒素合成的调控网络

Toxins (Basel). 2016 May 14;8(5):153. doi: 10.3390/toxins8050153.

Quality scores for 32,000 genomes.32000个基因组的质量得分。

Stand Genomic Sci. 2014 Dec 8;9:20. doi: 10.1186/1944-3277-9-20. eCollection 2014.

Analysis of the protein domain and domain architecture content in fungi and its application in the search of new antifungal targets.分析真菌中的蛋白质结构域和结构域体系内容及其在寻找新抗真菌靶点中的应用。

PLoS Comput Biol. 2014 Jul 17;10(7):e1003733. doi: 10.1371/journal.pcbi.1003733. eCollection 2014 Jul.

Sigma factors in a thousand E. coli genomes.大肠杆菌千基因组中的 sigma 因子。

Environ Microbiol. 2013 Dec;15(12):3121-9. doi: 10.1111/1462-2920.12236. Epub 2013 Aug 29.

Accelerated Profile HMM Searches.加速轮廓隐马尔可夫模型搜索。

PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ProdMX：基于压缩稀疏矩阵的蛋白质功能域快速查询与分析

ProdMX: Rapid query and analysis of protein functional domain based on compressed sparse matrices.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献