Kimura Kouichi, Koike Asako
Biosystems Research Department, Central Research Laboratory, Hitachi, Ltd., 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan.
Bioinformatics. 2015 May 15;31(10):1577-83. doi: 10.1093/bioinformatics/btv024. Epub 2015 Jan 20.
Sequence-variation analysis is conventionally performed on mapping results that are highly redundant and occasionally contain undesirable heuristic biases. A straightforward approach to single-nucleotide polymorphism (SNP) analysis, using the Burrows-Wheeler transform (BWT) of short-read data, is proposed.
The BWT makes it possible to simultaneously process collections of read fragments of the same sequences; accordingly, SNPs were found from the BWT much faster than from the mapping results. It took only a few minutes to find SNPs from the BWT (with a supplementary data, fragment depth of coverage [FDC]) using a desktop workstation in the case of human exome or transcriptome sequencing data and 20 min using a dual-CPU server in the case of human genome sequencing data. The SNPs found with the proposed method almost agreed with those found by a time-consuming state-of-the-art tool, except for the cases in which the use of fragments of reads led to sensitivity loss or sequencing depth was not sufficient. These exceptions were predictable in advance on the basis of minimum length for uniqueness (MLU) and FDC defined on the reference genome. Moreover, BWT and FDC were computed in less time than it took to get the mapping results, provided that the data were large enough.
序列变异分析通常是在高度冗余且偶尔包含不良启发式偏差的映射结果上进行的。本文提出了一种使用短读长数据的Burrows-Wheeler变换(BWT)进行单核苷酸多态性(SNP)分析的直接方法。
BWT使得能够同时处理相同序列的读段集合;因此,从BWT中发现SNP的速度比从映射结果中快得多。对于人类外显子组或转录组测序数据,使用台式工作站从BWT(结合补充数据,片段覆盖深度[FDC])中发现SNP仅需几分钟,而对于人类基因组测序数据,使用双CPU服务器则需20分钟。除了使用读段片段导致灵敏度损失或测序深度不足的情况外,用所提出的方法发现的SNP与通过耗时的最新工具发现的SNP几乎一致。这些例外情况可以根据参考基因组上定义的唯一最小长度(MLU)和FDC预先预测。此外,只要数据量足够大,计算BWT和FDC所需的时间比获得映射结果的时间要少。