Wittouck Stijn, Eilers Tom, van Noort Vera, Lebeer Sarah
Lab of Applied Microbiology and Biotechnology, Department of Bioscience Engineering, University of Antwerp, Antwerpen 2020, Belgium.
Faculty of Bioscience Engineering, KU Leuven, Leuven 3001, Belgium.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae735.
Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.
Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.
The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.
目前,原核生物比较基因组学的许多工作依赖于两项关键的计算任务:泛基因组推断和核心基因组推断。泛基因组推断包括将一组基因组中的基因聚类到基因家族中,从而能够进行全基因组关联研究和进化历史分析。核心基因组代表几乎所有基因组中都存在的基因家族,是推断高质量系统发育所必需的。对于物种水平的数据集,已经开发了快速泛基因组推断工具。然而,适用于更多样化数据集的工具目前速度较慢且扩展性不佳。
在此,我们介绍了SCARAP,这是一个包含三个用于比较基因组学分析模块的程序:一个快速且可扩展的泛基因组推断模块、一个直接核心基因组推断模块以及一个用于对代表性基因组进行二次抽样的模块。在与现有工具进行基准测试时,SCARAP的泛基因组模块在准确性相当的情况下速度快了一个数量级。核心模块通过将其结果与从完整泛基因组中提取的核心基因组进行比较来验证。抽样模块展示了随着新颖性降低对基因组进行快速抽样的能力。应用于超过31000个乳杆菌目基因组的数据集时,SCARAP展示了其推导代表性泛基因组的能力。最后,我们将基因固定频率的新概念应用于这个泛基因组,结果表明在乳杆菌目中普遍存在但在物种中很少固定的基因通常编码噬菌体功能。
SCARAP工具包可在https://github.com/swittouck/scarap上公开获取。