Li Heng, Marin Maximillian, Farhat Maha Reda
Dana-Farber Cancer Institute, Boston, MA 02215, USA.
Harvard Medical School, Boston, MA 02215, USA.
Bioinformatics. 2024 Jul 23;40(7). doi: 10.1093/bioinformatics/btae456.
The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome.
We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools.
Source code at https://github.com/lh3/pangene; pre-built pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org.
基因组成分调控着生物体的生物学特性。它在不同物种之间以及同一物种的不同个体之间存在差异。尽管已经开发出了一些工具来识别细菌基因组中的基因组成分变化,但没有一种工具适用于人类泛基因组等大型真核生物基因组集合。
我们开发了pangene,这是一种计算工具,用于识别基因组集合中的基因方向、基因顺序和基因拷贝数变化。pangene将一组输入的蛋白质序列与基因组进行比对,解决蛋白质序列之间的冗余问题,并构建一个基因图,每个基因组在图中表示为一条路径。它还能找到子图,我们称之为双气泡图,这些子图捕捉了基因组成分的变化。应用于人类泛基因组时,pangene识别出已知的基因水平变异,并揭示了以前研究较少的复杂单倍型。pangene也适用于高质量的细菌泛基因组,与现有工具相比,它报告的核心基因和辅助基因数量相似。