Shajii Ariya, Yorukoglu Deniz, William Yu Yun, Berger Bonnie
Department of Electrical & Computer Engineering, Boston University, Boston, MA 02215, USA.
Computer Science and AI Lab.
Bioinformatics. 2016 Sep 1;32(17):i538-i544. doi: 10.1093/bioinformatics/btw460.
As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS).
We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays.
LAVA software is available at http://lava.csail.mit.edu
Supplementary data are available at Bioinformatics online.
随着下一代测序(NGS)数据量的增加,需要更快的算法。虽然加快序列分析流程的各个组件(例如读取映射)可以降低分析的计算成本,但此类方法并未充分利用给定问题的具体情况。一个备受关注的问题是对一组已知变体(例如dbSNP或Affymetrix SNPs)进行基因分型,这对于个体内已知遗传特征和致病疾病变体的表征以及许多祖先和群体基因组流程(例如全基因组关联研究)的初始阶段都很重要。
我们引入了变体等位基因的轻量级分配(LAVA),这是一种针对给定SNP位点集的基于NGS的基因分型算法,它利用了这样一个事实,即中等大小的k-mer(k = 32)的近似匹配通常可以在无需完全读取比对的情况下唯一地识别人类基因组中的位点。LAVA能够准确地对dbSNP和Affymetrix的全基因组人类SNP Array 6.0中的绝大多数SNP进行分型,速度比标准NGS基因分型流程快大约一个数量级。对于Affymetrix SNPs,LAVA在使用低至约5GB随机存取存储器时,SNP分型准确性明显高于现有流程。因此,LAVA代表了一种适用于群体水平基因分型研究的可扩展计算方法,也是基于NGS的SNP阵列的灵活替代品。
LAVA软件可在http://lava.csail.mit.edu获取。
补充数据可在《生物信息学》在线获取。