Suppr超能文献

计算图泛基因组学:数据结构及其应用教程

Computational graph pangenomics: a tutorial on data structures and their applications.

作者信息

Baaijens Jasmijn A, Bonizzoni Paola, Boucher Christina, Della Vedova Gianluca, Pirola Yuri, Rizzi Raffaella, Sirén Jouni

机构信息

Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands.

Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA.

出版信息

Nat Comput. 2022 Mar;21(1):81-108. doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4.

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or , is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of and the variability of in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

摘要

计算泛基因组学是一个新兴的研究领域,它正在改变计算机科学家应对生物序列分析挑战的方式。在过去几十年中,组合数学、字符串学、图论和数据结构的贡献对于开发大量用于人类基因组分析的软件工具至关重要。这些工具使计算生物学家能够在群体规模上开展雄心勃勃的项目,例如千人基因组计划。千人基因组计划的一项主要贡献是对人类基因组中广泛的遗传变异进行了表征,包括在南亚、非洲和欧洲人群中发现新的变异,从而丰富了参考基因组中的变异目录。目前,在个性化医疗中考虑群体基因组的高变异性以及个体基因组的特异性的需求,正迅速促使人们摒弃使用单一参考基因组的传统模式。一种基于图的多个基因组的表示形式,即泛基因组,正在取代线性参考基因组。这意味着要彻底重新思考分析、存储和访问来自基因组表示信息的既定程序。妥善应对这些挑战对于面对旨在通过对100万个个体进行测序来表征人类多样性的雄心勃勃的医疗项目的计算任务至关重要(斯塔克等人,2019年)。本教程旨在向读者介绍用于表示图泛基因组的数据结构理论的最新进展。我们讨论了泛基因组的有效表示形式以及图泛基因组中泛基因组的变异性,并重点介绍了在解决人类和微生物(病毒)泛基因组计算问题中的应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c85/10038355/f066272c6673/nihms-1804781-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验