kWIP：k-mer加权内积，一种遗传相似性的从头估计器。

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity.

作者信息

Murray Kevin D, Webers Christfried, Ong Cheng Soon, Borevitz Justin, Warthmann Norman

机构信息

Research School of Biology, The Australian National University, Canberra, Australia.

Data61, CSIRO, Canberra, Australia.

出版信息

PLoS Comput Biol. 2017 Sep 5;13(9):e1005727. doi: 10.1371/journal.pcbi.1005727. eCollection 2017 Sep.

DOI:10.1371/journal.pcbi.1005727

PMID:28873405

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5600398/

Abstract

Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.

摘要

现代基因组学技术产生了海量数据。提取群体遗传变异需要计算效率高的方法，以便以无偏倚的方式确定个体（或“样本”）之间的遗传相关性，最好是从头开始确定。直接从测序数据中快速估计遗传相关性有可能克服参考基因组偏差，并在使用错误标记或错误识别的样本得出结论之前，验证个体是否属于正确的遗传谱系。我们提出了k-mer加权内积（kWIP），这是一种无需组装和比对的遗传相似性估计方法。kWIP将概率数据结构与一种新的度量——加权内积（WIP）相结合，从k-mer计数中高效计算测序运行之间的成对相似性。它生成一个距离矩阵，然后可以对其进行进一步分析和可视化。我们的方法不需要对基础基因组有先验知识，其应用包括确定样本身份、检测混淆、非明显的基因组变异和群体结构。我们表明，kWIP可以重建模拟群体中样本之间的真实相关性。通过重新分析几个已发表的数据集，我们表明我们的结果与基于标记的分析一致。kWIP用C++编写，遵循GNU GPL许可，可从https://github.com/kdmurray91/kwip获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7a69/5600398/ce9d8ac34560/pcbi.1005727.g001.jpg

相似文献

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity.kWIP：k-mer加权内积，一种遗传相似性的从头估计器。

PLoS Comput Biol. 2017 Sep 5;13(9):e1005727. doi: 10.1371/journal.pcbi.1005727. eCollection 2017 Sep.

ntCard: a streaming algorithm for cardinality estimation in genomics data.ntCard：一种用于基因组数据基数估计的流算法。

Bioinformatics. 2017 May 1;33(9):1324-1330. doi: 10.1093/bioinformatics/btw832.

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing.Lerna：用于配置短读和长读基因组测序错误纠正工具的变压器架构。

BMC Bioinformatics. 2022 Jan 6;23(1):25. doi: 10.1186/s12859-021-04547-0.

Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches.索马利尔：利用高效的基因组草图进行癌症和种系研究的快速相关性估计。

Genome Med. 2020 Jul 14;12(1):62. doi: 10.1186/s13073-020-00761-2.

Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants.快速估计亲缘关系密切的基因组变异体异质群体成员之间的遗传关系。

BMC Bioinformatics. 2018 Oct 22;19(Suppl 11):360. doi: 10.1186/s12859-018-2333-9.

KCOSS: an ultra-fast k-mer counter for assembled genome analysis.KCOSS：用于组装基因组分析的超快速k-mer计数器。

Bioinformatics. 2022 Jan 27;38(4):933-940. doi: 10.1093/bioinformatics/btab797.

Modelling haplotypes with respect to reference cohort variation graphs.基于参考队列变异图对单倍型进行建模。

Bioinformatics. 2017 Jul 15;33(14):i118-i123. doi: 10.1093/bioinformatics/btx236.

Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier.Phy-Mer：一种新型的无需比对且不依赖参考序列的线粒体单倍群分类器。

Bioinformatics. 2015 Apr 15;31(8):1310-2. doi: 10.1093/bioinformatics/btu825. Epub 2014 Dec 12.

Predicting discovery rates of genomic features.预测基因组特征的发现率。

Genetics. 2014 Jun;197(2):601-10. doi: 10.1534/genetics.114.162149. Epub 2014 Mar 17.

A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

引用本文的文献

K-mer-based Approaches to Bridging Pangenomics and Population Genetics.基于K-mer的泛基因组学与群体遗传学关联方法。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.

Local Genomic Surveillance of Invasive in Eastern North Carolina (ENC) in 2022-2023.2022-2023 年美国北卡罗来纳州东部侵袭性本地基因组监测。

Int J Mol Sci. 2024 Jul 26;25(15):8179. doi: 10.3390/ijms25158179.

Comparison of k-mer-based comparative metagenomic tools and approaches.基于k-mer的比较宏基因组学工具和方法的比较。

Microbiome Res Rep. 2023 Jul 20;2(4):27. doi: 10.20517/mrr.2023.26. eCollection 2023.

Whole genome sequencing of human Borrelia burgdorferi isolates reveals linked blocks of accessory genome elements located on plasmids and associated with human dissemination.对人类伯氏疏螺旋体分离株的全基因组测序揭示了位于质粒上并与人类传播相关的辅助基因组元件的连锁块。

PLoS Pathog. 2023 Aug 31;19(8):e1011243. doi: 10.1371/journal.ppat.1011243. eCollection 2023 Aug.

Whole genome sequencing of isolates reveals linked clusters of plasmid-borne accessory genome elements associated with virulence.分离株的全基因组测序揭示了与毒力相关的质粒携带的辅助基因组元件的连锁簇。

bioRxiv. 2023 Feb 27:2023.02.26.530159. doi: 10.1101/2023.02.26.530159.

Feature extraction based on microstate sequences for EEG-based emotion recognition.基于微状态序列的脑电情感识别特征提取

Front Psychol. 2022 Dec 23;13:1065196. doi: 10.3389/fpsyg.2022.1065196. eCollection 2022.

The first long-read nuclear genome assembly of Oryza australiensis, a wild rice from northern Australia.澳大利亚野生稻 Oryza australiensis 的首个长读核基因组组装。

Sci Rep. 2022 Jun 25;12(1):10823. doi: 10.1038/s41598-022-14893-5.

AutoCoV: tracking the early spread of COVID-19 in terms of the spatial and temporal patterns from embedding space by K-mer based deep learning.AutoCoV：基于 K -mer 深度学习的嵌入空间追踪 COVID-19 时空模式的早期传播。

BMC Bioinformatics. 2022 Apr 25;23(Suppl 3):149. doi: 10.1186/s12859-022-04679-x.

The EpiDiverse Plant Epigenome-Wide Association Studies (EWAS) Pipeline.EpiDiverse植物全基因组表观遗传关联研究（EWAS）流程

Epigenomes. 2021 May 4;5(2):12. doi: 10.3390/epigenomes5020012.

EpiDiverse Toolkit: a pipeline suite for the analysis of bisulfite sequencing data in ecological plant epigenetics.EpiDiverse工具包：用于生态植物表观遗传学中重亚硫酸盐测序数据分析的流程套件。

NAR Genom Bioinform. 2021 Nov 12;3(4):lqab106. doi: 10.1093/nargab/lqab106. eCollection 2021 Dec.

本文引用的文献

When more is better: how data sharing would accelerate genomic selection of crop plants.多多益善：数据共享如何加速作物的基因组选择

New Phytol. 2016 Dec;212(4):814-826. doi: 10.1111/nph.14174. Epub 2016 Sep 26.

Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction.用于系统发育树重建的统计一致k-mer方法

J Comput Biol. 2017 Feb;24(2):153-171. doi: 10.1089/cmb.2015.0216. Epub 2016 Jul 7.

Mash: fast genome and metagenome distance estimation using MinHash.Mash：使用MinHash进行快速的基因组和宏基因组距离估计。

Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

A Proposal Regarding Best Practices for Validating the Identity of Genetic Stocks and the Effects of Genetic Variants.关于验证遗传品系身份及遗传变异影响的最佳实践的提案。

Plant Cell. 2016 Mar;28(3):606-9. doi: 10.1105/tpc.15.00502. Epub 2016 Mar 8.

Fast search of thousands of short-read sequencing experiments.快速搜索数千个短读长测序实验。

Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

The khmer software package: enabling efficient nucleotide sequence analysis.高棉软件包：实现高效的核苷酸序列分析

F1000Res. 2015 Sep 25;4:900. doi: 10.12688/f1000research.6924.1. eCollection 2015.

Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.刺胞动物门：原始和组装的基因组及转录组二代测序数据的快速、无参考聚类

BMC Bioinformatics. 2015 Nov 2;16:352. doi: 10.1186/s12859-015-0806-7.

Whole-Genome Resequencing Reveals Extensive Natural Variation in the Model Green Alga Chlamydomonas reinhardtii.全基因组重测序揭示了模式绿藻莱茵衣藻中广泛的自然变异。

Plant Cell. 2015 Sep;27(9):2353-69. doi: 10.1105/tpc.15.00492. Epub 2015 Sep 21.

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data.一种从下一代测序数据中重建系统发育树的无需组装和比对的方法。

BMC Genomics. 2015 Jul 14;16(1):522. doi: 10.1186/s12864-015-1647-5.

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.从二代测序数据推断分子序列的马尔可夫性质及其在比较基因组学中的应用。

Bioinformatics. 2016 Apr 1;32(7):993-1000. doi: 10.1093/bioinformatics/btv395. Epub 2015 Jun 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

kWIP：k-mer加权内积，一种遗传相似性的从头估计器。

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献