Suppr超能文献

蛋白质基因组学数据分析方法、挑战及可扩展性瓶颈:一项综述。

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey.

作者信息

Tariq Muhammad Usman, Haseeb Muhammad, Aledhari Mohammed, Razzak Rehma, Parizi Reza M, Saeed Fahad

机构信息

School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA.

College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA.

出版信息

IEEE Access. 2021;9:5497-5516. doi: 10.1109/ACCESS.2020.3047588. Epub 2020 Dec 25.

Abstract

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.

摘要

大数据蛋白质基因组学处于基于高通量质谱(MS)的蛋白质组学和基于下一代测序的基因组学的交叉点。对这两种高通量技术进行联合和综合分析,有助于利用基因组和转录组数据发现新的蛋白质。由于综合分析具有生物学意义,近年来出现了大量蛋白质基因组学工具,这些工具可执行各种任务,包括将蛋白质映射到基因组数据、针对六框架翻译基因组数据库搜索实验性质谱图,以及自动化注释基因组序列的过程。迄今为止,大多数此类工具尚未关注蛋白质基因组数据分析中固有的可扩展性问题,在这种分析中,数据库的规模比典型的蛋白质数据库大得多。这些最先进的工具处理一个针对3GB基因组的百万规模谱图的小规模数据集可能需要半个多月的时间。在本文中,我们对可分析蛋白质基因组数据集的工具进行了最新综述,对这些技术的相对优点和潜在缺陷进行了批判性分析。我们还指出了潜在的瓶颈以及可纳入这些工作流程未来设计中的建议,以确保随着蛋白质基因组数据规模的不断增大仍具备可扩展性。最后,我们阐述了高性能计算(HPC)解决方案为何可能是确保未来大数据蛋白质基因组数据分析可扩展性的最佳选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/658f/7853650/6d646f41f238/nihms-1662073-f0007.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验