Lindberg Michael R, Hall Ira M, Quinlan Aaron R
Department of Biochemistry and Molecular Genetics, Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA, Department of Medicine, The Genome Institute, Washington University School of Medicine, St. Louis MO, USA and Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
Department of Biochemistry and Molecular Genetics, Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA, Department of Medicine, The Genome Institute, Washington University School of Medicine, St. Louis MO, USA and Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA Department of Biochemistry and Molecular Genetics, Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA, Department of Medicine, The Genome Institute, Washington University School of Medicine, St. Louis MO, USA and Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA Department of Biochemistry and Molecular Genetics, Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA, Department of Medicine, The Genome Institute, Washington University School of Medicine, St. Louis MO, USA and Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA Department of Biochemistry and Molecular Genetics, Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA, Department of Medicine, The Genome Institute, Washington University School of Medicine, St. Louis MO, USA and Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
Bioinformatics. 2015 Apr 15;31(8):1286-9. doi: 10.1093/bioinformatics/btu771. Epub 2014 Dec 2.
Current strategies for SNP and INDEL discovery incorporate sequence alignments from multiple individuals to maximize sensitivity and specificity. It is widely accepted that this approach also improves structural variant (SV) detection. However, multisample SV analysis has been stymied by the fundamental difficulties of SV calling, e.g. library insert size variability, SV alignment signal integration and detecting long-range genomic rearrangements involving disjoint loci. Extant tools suffer from poor scalability, which limits the number of genomes that can be co-analyzed and complicates analysis workflows. We have developed an approach that enables multisample SV analysis in hundreds to thousands of human genomes using commodity hardware. Here, we describe Hydra-Multi and measure its accuracy, speed and scalability using publicly available datasets provided by The 1000 Genomes Project and by The Cancer Genome Atlas (TCGA).
Hydra-Multi is written in C++ and is freely available at https://github.com/arq5x/Hydra.
aaronquinlan@gmail.com or ihall@genome.wustl.edu
Supplementary data are available at Bioinformatics online.
当前用于单核苷酸多态性(SNP)和插入缺失(INDEL)发现的策略纳入了多个个体的序列比对,以最大化敏感性和特异性。人们普遍认为这种方法也能改善结构变异(SV)检测。然而,多样本SV分析一直受到SV检测基本难题的阻碍,例如文库插入片段大小的变异性、SV比对信号整合以及检测涉及不连续位点的长程基因组重排。现有工具存在扩展性差的问题,这限制了可共同分析的基因组数量,并使分析工作流程复杂化。我们开发了一种方法,能够使用商用硬件对数百至数千个人类基因组进行多样本SV分析。在此,我们描述了Hydra-Multi,并使用千人基因组计划和癌症基因组图谱(TCGA)提供的公开可用数据集来衡量其准确性、速度和扩展性。
Hydra-Multi用C++编写,可在https://github.com/arq5x/Hydra上免费获取。
aaronquinlan@gmail.com或ihall@genome.wustl.edu
补充数据可在《生物信息学》在线获取。