KmerGO：一种用于通过k聚体识别特定群体序列的工具。

KmerGO: A Tool to Identify Group-Specific Sequences With -mers.

作者信息

Wang Ying, Chen Qi, Deng Chao, Zheng Yiluan, Sun Fengzhu

机构信息

Department of Automation, Xiamen University, Xiamen, China.

Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision-Making, Xiamen, China.

出版信息

Front Microbiol. 2020 Aug 25;11:2067. doi: 10.3389/fmicb.2020.02067. eCollection 2020.

DOI:10.3389/fmicb.2020.02067

PMID:32983048

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7477287/

Abstract

Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a "group-specific" sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific -mers ( up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including -mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific -mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at https://github.com/ChnMasterOG/KmerGO.

摘要

捕获两组基因组/宏基因组序列之间的组特异性序列对于后续识别单核苷酸变异（SNV）、基因家族、微生物物种或与每组相关的其他元素至关重要。在本研究中，在一组中存在或丰富而在另一组中不存在或稀少的序列被视为“组特异性”序列。我们开发了一个用户友好的工具KmerGO，用于识别两组基因组/宏基因组长序列或高通量测序数据集之间的组特异性序列。与其他工具相比，KmerGO以更低的计算资源需求和更短的运行时间捕获组特异性k-mer（最长40个碱基对）。对于一个1.05 TB的数据集（.fasta），在内存不超过1 GB的普通独立工作站上，KmerGO大约需要2１.5小时（包括k-mer计数）来返回组装好的组特异性序列。此外，KmerGO还可用于捕获与连续性状相关的序列。通过多进程并行计算，KmerGO在Linux和Windows上通过图形用户界面和命令行实现，无需任何预先安装的支持环境、软件包和复杂配置。KmerGO输出的组特异性k-mer或序列可以作为其他工具的输入，用于下游生物标志物的发现，如基因变异、物种或基因。可在https://github.com/ChnMasterOG/KmerGO获取KmerGO。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9f7/7477287/75bd0f5ea971/fmicb-11-02067-g001.jpg

相似文献

KmerGO: A Tool to Identify Group-Specific Sequences With -mers.

Front Microbiol. 2020 Aug 25;11:2067. doi: 10.3389/fmicb.2020.02067. eCollection 2020.

Identifying Sequences for Microbial Communities Using Long -mer Sequence Signatures.

Front Microbiol. 2018 May 3;9:872. doi: 10.3389/fmicb.2018.00872. eCollection 2018.

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.

KAnalyze: a fast versatile pipelined k-mer toolkit.

Bioinformatics. 2014 Jul 15;30(14):2070-2. doi: 10.1093/bioinformatics/btu152. Epub 2014 Mar 18.

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs.

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):408. doi: 10.1186/s12859-017-1825-3.

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.

PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. eCollection 2016.

Analysis of common k-mers for whole genome sequences using SSB-tree.

Genome Inform. 2002;13:30-41.

Estimating the total genome length of a metagenomic sample using k-mers.

BMC Genomics. 2019 Apr 4;20(Suppl 2):183. doi: 10.1186/s12864-019-5467-x.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria.

PLoS Comput Biol. 2018 Oct 22;14(10):e1006434. doi: 10.1371/journal.pcbi.1006434. eCollection 2018 Oct.

引用本文的文献

Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria.

F1000Res. 2025 Jun 20;13:981. doi: 10.12688/f1000research.154432.2. eCollection 2024.

The fourspine stickleback (Apeltes quadracus) has an XY sex chromosome system with polymorphic inversions on both X and Y chromosomes.

PLoS Genet. 2025 May 9;21(5):e1011465. doi: 10.1371/journal.pgen.1011465. eCollection 2025 May.

Spiral phyllotaxis predicts left-right asymmetric growth and style deflection in mirror-image flowers of Cyanella alba.

Nat Commun. 2025 Apr 18;16(1):3695. doi: 10.1038/s41467-025-58803-5.

Inferring Staphylococcus aureus host species and cross-species transmission from a genome-based model.

BMC Genomics. 2025 Feb 17;26(1):149. doi: 10.1186/s12864-025-11331-4.

A survey of k-mer methods and applications in bioinformatics.

Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.

Comparison of k-mer-based comparative metagenomic tools and approaches.

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

-mer-Based Genome-Wide Association Studies in Plants: Advances, Challenges, and Perspectives.

Genes (Basel). 2023 Jul 13;14(7):1439. doi: 10.3390/genes14071439.

Identifying individual-specific microbial DNA fingerprints from skin microbiomes.

Front Microbiol. 2022 Oct 6;13:960043. doi: 10.3389/fmicb.2022.960043. eCollection 2022.

The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms.

F1000Res. 2022 May 16;11:530. doi: 10.12688/f1000research.110194.1. eCollection 2022.

Hierarchical Microbial Functions Prediction by Graph Aggregated Embedding.

Front Genet. 2021 Jan 18;11:608512. doi: 10.3389/fgene.2020.608512. eCollection 2020.

本文引用的文献

Reads Binning Improves Alignment-Free Metagenome Comparison.

Front Genet. 2019 Nov 21;10:1156. doi: 10.3389/fgene.2019.01156. eCollection 2019.

Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.

iScience. 2019 Aug 30;18:28-36. doi: 10.1016/j.isci.2019.07.032. Epub 2019 Jul 23.

Skmer: assembly-free and alignment-free sample identification using genome skims.

Genome Biol. 2019 Feb 13;20(1):34. doi: 10.1186/s13059-019-1632-4.

A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.

PLoS Genet. 2018 Nov 12;14(11):e1007758. doi: 10.1371/journal.pgen.1007758. eCollection 2018 Nov.

Kmer-db: instant evolutionary distance estimation.

Bioinformatics. 2019 Jan 1;35(1):133-136. doi: 10.1093/bioinformatics/bty610.

Association mapping from sequencing reads using -mers.

Elife. 2018 Jun 13;7:e32920. doi: 10.7554/eLife.32920.

Identifying Sequences for Microbial Communities Using Long -mer Sequence Signatures.

Front Microbiol. 2018 May 3;9:872. doi: 10.3389/fmicb.2018.00872. eCollection 2018.

A concurrent subtractive assembly approach for identification of disease associated sub-metagenomes.

Res Comput Mol Biol. 2017;2017:18-33. doi: 10.1007/978-3-319-56970-3_2. Epub 2017 Apr 12.

KMC 3: counting and manipulating k-mer statistics.

Bioinformatics. 2017 Sep 1;33(17):2759-2761. doi: 10.1093/bioinformatics/btx304.

Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.

Sci Rep. 2016 Nov 23;6:37243. doi: 10.1038/srep37243.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

KmerGO：一种用于通过k聚体识别特定群体序列的工具。

KmerGO: A Tool to Identify Group-Specific Sequences With -mers.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献