Lee Wan-Ping, Stromberg Michael P, Ward Alistair, Stewart Chip, Garrison Erik P, Marth Gabor T
Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America.
Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America; Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
PLoS One. 2014 Mar 5;9(3):e90581. doi: 10.1371/journal.pone.0090581. eCollection 2014.
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
MOSAIK是一个稳定、灵敏且开源的程序,用于将第二代和第三代测序读数映射到参考基因组。在当前的映射工具中,MOSAIK独树一帜,它能够比对所有主流测序技术产生的读数,包括Illumina、应用生物系统公司的SOLiD、罗氏454、离子激流以及太平洋生物科学公司的单分子实时(SMRT)测序技术。事实上,MOSAIK是在千人基因组计划中,唯一能为所有生成数据(测序技术、低覆盖度和外显子组)提供一致映射的比对器。为了提供高度准确的比对,MOSAIK采用了哈希聚类策略并结合史密斯-沃特曼算法。这种方法非常适合捕捉错配以及短插入和缺失。为了支持对更大结构变异(SV)发现日益增长的兴趣,MOSAIK为处理已知序列的SV提供了明确支持,例如移动元件插入(MEI),以及生成有助于SV发现的定制输出。所有变异发现都受益于对读数放置置信度的准确描述。为此,MOSAIK使用基于神经网络的训练方案来提供校准良好的映射质量分数,MOSAIK分配的质量分数与实际映射质量之间的相关系数大于0.98就证明了这一点。为了确保支持对任何基因组的研究,提供了一个训练流程,以确保对所研究的基因组获得最佳映射质量分数。MOSAIK是多线程的、开源的,并被纳入我们的命令和流程启动器系统GKNO(http://gkno.me)。