• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于多蛋白质相似性的抽样方法,用于从大型数据库中选择代表性基因组。

Multi-proteins similarity-based sampling to select representative genomes from large databases.

作者信息

Coudert Rémi-Vinh, Charrier Jean-Philippe, Jauffrit Frédéric, Flandrois Jean-Pierre, Brochier-Armanet Céline

机构信息

Université Claude Bernard Lyon 1, LBBE, UMR 5558, CNRS, VAS, 69622, Villeurbanne, France.

Microbiology Research and Development, BioMérieux SA, 376 Chemin de L'Orme, 69280, Marcy-L'Étoile, France.

出版信息

BMC Bioinformatics. 2025 May 6;26(1):121. doi: 10.1186/s12859-025-06095-3.

DOI:10.1186/s12859-025-06095-3
PMID:40329187
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12057276/
Abstract

BACKGROUND

Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.

METHODS

Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.

RESULTS

MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.

CONCLUSION

MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.

摘要

背景

基因组序列数据库正呈指数级增长,但存在高冗余度和数据质量参差不齐的问题。由于这些原因,选择具有代表性的基因组子集几乎是所有研究的关键步骤。然而,当前大多数抽样方法存在偏差,且无法在合理时间内处理大型数据集。

方法

在此,我们提出了MPS抽样法(基于多蛋白相似性的抽样法),这是一种从超大型数据集中选择可靠且具代表性的基因组样本的快速、可扩展且高效的方法。MPS抽样法以同源蛋白家族作为输入,通过两个连续的聚类步骤来划分基因组的同类群组。然后根据预定义或用户定义的优先级标准在这些群组中选择代表性基因组。

结果

MPS抽样法应用于一个包含来自178,203个细菌基因组的48个核糖体蛋白家族的数据集,以生成各种规模的代表性基因组集,相当于对完整数据集进行32.17%至0.3%的抽样。深入分析表明,所选基因组在分类学和系统发育方面均代表了完整数据集,证明了该方法的相关性。

结论

MPS抽样法提供了一种在可接受的计算时间内对大量基因组集合进行抽样的高效、快速且可扩展的方法。MPS抽样法不依赖分类学信息,也不需要推断系统发育树,从而避免了这些方法中固有的偏差。因此,MPS抽样法满足了越来越多用户的需求。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/ea6ccab0662e/12859_2025_6095_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/2c0507427ee8/12859_2025_6095_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/b264f6508b4e/12859_2025_6095_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/5cb1465a9592/12859_2025_6095_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/03cce7916451/12859_2025_6095_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/41e5866cb516/12859_2025_6095_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/16a3e1bef48a/12859_2025_6095_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/dbd8a8bf8b5c/12859_2025_6095_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/ea6ccab0662e/12859_2025_6095_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/2c0507427ee8/12859_2025_6095_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/b264f6508b4e/12859_2025_6095_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/5cb1465a9592/12859_2025_6095_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/03cce7916451/12859_2025_6095_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/41e5866cb516/12859_2025_6095_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/16a3e1bef48a/12859_2025_6095_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/dbd8a8bf8b5c/12859_2025_6095_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ce81/12057276/ea6ccab0662e/12859_2025_6095_Fig8_HTML.jpg

相似文献

1
Multi-proteins similarity-based sampling to select representative genomes from large databases.基于多蛋白质相似性的抽样方法,用于从大型数据库中选择代表性基因组。
BMC Bioinformatics. 2025 May 6;26(1):121. doi: 10.1186/s12859-025-06095-3.
2
RibAlign: a software tool and database for eubacterial phylogeny based on concatenated ribosomal protein subunits.RibAlign:一种基于串联核糖体蛋白亚基的真细菌系统发育分析的软件工具和数据库。
BMC Bioinformatics. 2006 Feb 13;7:66. doi: 10.1186/1471-2105-7-66.
3
A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.一个使用黑马算法识别出的古菌和细菌基因组中系统发育非典型基因的数据库。
BMC Bioinformatics. 2008 Oct 7;9:419. doi: 10.1186/1471-2105-9-419.
4
SCARAP: scalable cross-species comparative genomics of prokaryotes.SCARAP:原核生物的可扩展跨物种比较基因组学
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae735.
5
ITEP: an integrated toolkit for exploration of microbial pan-genomes.ITEP:用于探索微生物泛基因组的集成工具包。
BMC Genomics. 2014 Jan 3;15:8. doi: 10.1186/1471-2164-15-8.
6
Genome trees constructed using five different approaches suggest new major bacterial clades.使用五种不同方法构建的基因组树表明了新的主要细菌进化枝。
BMC Evol Biol. 2001 Oct 20;1:8. doi: 10.1186/1471-2148-1-8.
7
Gene context analysis in the Integrated Microbial Genomes (IMG) data management system.基因上下文分析在集成微生物基因组(IMG)数据管理系统中。
PLoS One. 2009 Nov 24;4(11):e7979. doi: 10.1371/journal.pone.0007979.
8
CoreGenes3.5: a webserver for the determination of core genes from sets of viral and small bacterial genomes.CoreGenes3.5:一个用于从病毒和小型细菌基因组集合中确定核心基因的网络服务器。
BMC Res Notes. 2013 Apr 8;6:140. doi: 10.1186/1756-0500-6-140.
9
Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource.通过迭代序列聚类筛选基因组,可产生大量具有系统发育多样性的蛋白质家族资源。
BMC Bioinformatics. 2012 Oct 13;13:264. doi: 10.1186/1471-2105-13-264.
10
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.通过成对物种比较对直系同源基因和旁系同源基因进行自动聚类。
J Mol Biol. 2001 Dec 14;314(5):1041-52. doi: 10.1006/jmbi.2000.5197.

本文引用的文献

1
AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees.AncestralClust:基于系统发生树的祖先序列重建对分歧核苷酸序列进行聚类。
Bioinformatics. 2022 Jan 12;38(3):663-670. doi: 10.1093/bioinformatics/btab723.
2
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB:通过系统发生一致、等级归一化和基于完整基因组的分类学,对细菌和古菌多样性进行持续普查。
Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.
3
Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation.
交互式生命树 (iTOL) v5:一个用于显示和注释系统发育树的在线工具。
Nucleic Acids Res. 2021 Jul 2;49(W1):W293-W296. doi: 10.1093/nar/gkab301.
4
A Comprehensive Evolutionary Scenario of Cell Division and Associated Processes in the Firmicutes.厚壁菌门中细胞分裂和相关过程的综合进化场景。
Mol Biol Evol. 2021 May 19;38(6):2396-2412. doi: 10.1093/molbev/msab034.
5
Innovations to culturing the uncultured microbial majority.培养未培养微生物大多数的创新方法。
Nat Rev Microbiol. 2021 Apr;19(4):225-240. doi: 10.1038/s41579-020-00458-8. Epub 2020 Oct 22.
6
A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。
Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.
7
TreeCluster: Clustering biological sequences using phylogenetic trees.TreeCluster:使用系统发生树进行生物序列聚类。
PLoS One. 2019 Aug 22;14(8):e0221068. doi: 10.1371/journal.pone.0221068. eCollection 2019.
8
Phylogenetic Clustering by Linear Integer Programming (PhyCLIP).基于线性整数规划的系统发育聚类(PhyCLIP)。
Mol Biol Evol. 2019 Jul 1;36(7):1580-1595. doi: 10.1093/molbev/msz053.
9
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.
10
Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity.Treemmer:一种可减少大型系统发育数据集而最小化多样性损失的工具。
BMC Bioinformatics. 2018 May 2;19(1):164. doi: 10.1186/s12859-018-2164-8.