Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, MA 02142, USA.
Bioinformatics. 2012 Jul 15;28(14):1838-44. doi: 10.1093/bioinformatics/bts280. Epub 2012 May 7.
Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs.
To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index.
尤金·迈尔斯(Eugene Myers)在他的字符串图论文中提出,在字符串图或等效的单元图中,任何路径都代表有效的组装。由于字符串/单元图还编码了所有读取的有效组装,因此只要可以正确构建该图,实际上它就是读取的无损表示。原则上,基于全基因组鸟枪法测序(WGS)数据的所有分析,例如 SNP 和插入/缺失(INDEL)调用,都可以使用单元来实现。
为了探索从头组装在重测序背景下的可行性,我们开发了一个名为 fermi 的从头组装程序,该程序将 Illumina 短读取组装成单元,同时保留了输入读取的大部分信息。通过将单元映射到参考基因组,可以调用 SNP 和 INDEL。通过在 35 倍人类重测序数据上应用该方法,我们表明与标准流水线相比,我们的方法在 SNP 调用方面具有相似的准确性,在 INDEL 调用方面具有更好的结果。与其他基于从头组装的方法相比,它在变体调用方面具有更高的灵敏度。我们的工作表明,从头组装的变体调用可以成为全基因组重测序标准变体调用流水线的有益补充。在方法方面,我们提出了 FMD-index 用于 DNA 序列的前后扩展,一种快速算法用于查找所有超最大精确匹配,以及从 FMD-index 一次构建单元。