Suppr超能文献

通过近似k-mer匹配对已知单核苷酸多态性进行快速基因分型。

Fast genotyping of known SNPs through approximate k-mer matching.

作者信息

Shajii Ariya, Yorukoglu Deniz, William Yu Yun, Berger Bonnie

机构信息

Department of Electrical & Computer Engineering, Boston University, Boston, MA 02215, USA.

Computer Science and AI Lab.

出版信息

Bioinformatics. 2016 Sep 1;32(17):i538-i544. doi: 10.1093/bioinformatics/btw460.

Abstract

MOTIVATION

As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS).

RESULTS

We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays.

AVAILABILITY AND IMPLEMENTATION

LAVA software is available at http://lava.csail.mit.edu

CONTACT

bab@mit.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着下一代测序(NGS)数据量的增加,需要更快的算法。虽然加快序列分析流程的各个组件(例如读取映射)可以降低分析的计算成本,但此类方法并未充分利用给定问题的具体情况。一个备受关注的问题是对一组已知变体(例如dbSNP或Affymetrix SNPs)进行基因分型,这对于个体内已知遗传特征和致病疾病变体的表征以及许多祖先和群体基因组流程(例如全基因组关联研究)的初始阶段都很重要。

结果

我们引入了变体等位基因的轻量级分配(LAVA),这是一种针对给定SNP位点集的基于NGS的基因分型算法,它利用了这样一个事实,即中等大小的k-mer(k = 32)的近似匹配通常可以在无需完全读取比对的情况下唯一地识别人类基因组中的位点。LAVA能够准确地对dbSNP和Affymetrix的全基因组人类SNP Array 6.0中的绝大多数SNP进行分型,速度比标准NGS基因分型流程快大约一个数量级。对于Affymetrix SNPs,LAVA在使用低至约5GB随机存取存储器时,SNP分型准确性明显高于现有流程。因此,LAVA代表了一种适用于群体水平基因分型研究的可扩展计算方法,也是基于NGS的SNP阵列的灵活替代品。

可用性和实现

LAVA软件可在http://lava.csail.mit.edu获取。

联系方式

bab@mit.edu

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

1
Fast genotyping of known SNPs through approximate k-mer matching.
Bioinformatics. 2016 Sep 1;32(17):i538-i544. doi: 10.1093/bioinformatics/btw460.
2
A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays.
Bioinformatics. 2007 Jun 15;23(12):1459-67. doi: 10.1093/bioinformatics/btm131. Epub 2007 Apr 25.
3
Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays.
Bioinformatics. 2005 May 1;21(9):1958-63. doi: 10.1093/bioinformatics/bti275. Epub 2005 Jan 18.
4
Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.
Bioinformatics. 2019 Feb 1;35(3):415-420. doi: 10.1093/bioinformatics/bty641.
6
Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies.
PLoS One. 2016 Aug 22;11(8):e0161333. doi: 10.1371/journal.pone.0161333. eCollection 2016.
7
Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens.
PLoS One. 2022 Jan 31;17(1):e0262574. doi: 10.1371/journal.pone.0262574. eCollection 2022.
9
iCall: a genotype-calling algorithm for rare, low-frequency and common variants on the Illumina exome array.
Bioinformatics. 2014 Jun 15;30(12):1714-20. doi: 10.1093/bioinformatics/btu107. Epub 2014 Feb 23.
10
SNiPer: improved SNP genotype calling for Affymetrix 10K GeneChip microarray data.
BMC Genomics. 2005 Oct 31;6:149. doi: 10.1186/1471-2164-6-149.

引用本文的文献

3
A scalable distributed pipeline for reference-free variants calling.
BMC Genomics. 2025 Jun 3;26(Suppl 1):557. doi: 10.1186/s12864-025-11722-7.
4
K-mer-based Approaches to Bridging Pangenomics and Population Genetics.
Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf047.
7
Effects of spaced k-mers on alignment-free genotyping.
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i213-i221. doi: 10.1093/bioinformatics/btad202.
8
Matchtigs: minimum plain text representation of k-mer sets.
Genome Biol. 2023 Jun 9;24(1):136. doi: 10.1186/s13059-023-02968-z.
9
Pangenomic genotyping with the marker array.
Algorithms Mol Biol. 2023 May 5;18(1):2. doi: 10.1186/s13015-023-00225-3.
10
Pitfalls of genotyping microbial communities with rapidly growing genome collections.
Cell Syst. 2023 Feb 15;14(2):160-176.e3. doi: 10.1016/j.cels.2022.12.007. Epub 2023 Jan 18.

本文引用的文献

1
Computational Biology in the 21st Century: Scaling with Compressive Algorithms.
Commun ACM. 2016 Aug;59(8):72-80. doi: 10.1145/2957324.
2
Compressive mapping for next-generation sequencing.
Nat Biotechnol. 2016 Apr;34(4):374-6. doi: 10.1038/nbt.3511.
3
Near-optimal probabilistic RNA-seq quantification.
Nat Biotechnol. 2016 May;34(5):525-7. doi: 10.1038/nbt.3519. Epub 2016 Apr 4.
4
Entropy-scaling search of massive biological data.
Cell Syst. 2015 Aug 26;1(2):130-140. doi: 10.1016/j.cels.2015.08.004.
6
Quality score compression improves genotyping accuracy.
Nat Biotechnol. 2015 Mar;33(3):240-3. doi: 10.1038/nbt.3170.
7
Technology: The $1,000 genome.
Nature. 2014 Mar 20;507(7492):294-5. doi: 10.1038/507294a.
8
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.
9
Compressive genomics for protein databases.
Bioinformatics. 2013 Jul 1;29(13):i283-90. doi: 10.1093/bioinformatics/btt214.
10
Computational solutions for omics data.
Nat Rev Genet. 2013 May;14(5):333-46. doi: 10.1038/nrg3433.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验