Murray Kevin D, Webers Christfried, Ong Cheng Soon, Borevitz Justin, Warthmann Norman
Research School of Biology, The Australian National University, Canberra, Australia.
Data61, CSIRO, Canberra, Australia.
PLoS Comput Biol. 2017 Sep 5;13(9):e1005727. doi: 10.1371/journal.pcbi.1005727. eCollection 2017 Sep.
Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.
现代基因组学技术产生了海量数据。提取群体遗传变异需要计算效率高的方法,以便以无偏倚的方式确定个体(或“样本”)之间的遗传相关性,最好是从头开始确定。直接从测序数据中快速估计遗传相关性有可能克服参考基因组偏差,并在使用错误标记或错误识别的样本得出结论之前,验证个体是否属于正确的遗传谱系。我们提出了k-mer加权内积(kWIP),这是一种无需组装和比对的遗传相似性估计方法。kWIP将概率数据结构与一种新的度量——加权内积(WIP)相结合,从k-mer计数中高效计算测序运行之间的成对相似性。它生成一个距离矩阵,然后可以对其进行进一步分析和可视化。我们的方法不需要对基础基因组有先验知识,其应用包括确定样本身份、检测混淆、非明显的基因组变异和群体结构。我们表明,kWIP可以重建模拟群体中样本之间的真实相关性。通过重新分析几个已发表的数据集,我们表明我们的结果与基于标记的分析一致。kWIP用C++编写,遵循GNU GPL许可,可从https://github.com/kdmurray91/kwip获取。