Coudert Rémi-Vinh, Charrier Jean-Philippe, Jauffrit Frédéric, Flandrois Jean-Pierre, Brochier-Armanet Céline
Université Claude Bernard Lyon 1, LBBE, UMR 5558, CNRS, VAS, 69622, Villeurbanne, France.
Microbiology Research and Development, BioMérieux SA, 376 Chemin de L'Orme, 69280, Marcy-L'Étoile, France.
BMC Bioinformatics. 2025 May 6;26(1):121. doi: 10.1186/s12859-025-06095-3.
Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.
Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.
MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.
MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.
基因组序列数据库正呈指数级增长,但存在高冗余度和数据质量参差不齐的问题。由于这些原因,选择具有代表性的基因组子集几乎是所有研究的关键步骤。然而,当前大多数抽样方法存在偏差,且无法在合理时间内处理大型数据集。
在此,我们提出了MPS抽样法(基于多蛋白相似性的抽样法),这是一种从超大型数据集中选择可靠且具代表性的基因组样本的快速、可扩展且高效的方法。MPS抽样法以同源蛋白家族作为输入,通过两个连续的聚类步骤来划分基因组的同类群组。然后根据预定义或用户定义的优先级标准在这些群组中选择代表性基因组。
MPS抽样法应用于一个包含来自178,203个细菌基因组的48个核糖体蛋白家族的数据集,以生成各种规模的代表性基因组集,相当于对完整数据集进行32.17%至0.3%的抽样。深入分析表明,所选基因组在分类学和系统发育方面均代表了完整数据集,证明了该方法的相关性。
MPS抽样法提供了一种在可接受的计算时间内对大量基因组集合进行抽样的高效、快速且可扩展的方法。MPS抽样法不依赖分类学信息,也不需要推断系统发育树,从而避免了这些方法中固有的偏差。因此,MPS抽样法满足了越来越多用户的需求。