Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
Bioinformatics. 2010 Jun 15;26(12):i343-9. doi: 10.1093/bioinformatics/btq184.
High-throughput sequencing (HTS) technologies are transforming the study of genomic variation. The various HTS technologies have different sequencing biases and error rates, and while most HTS technologies sequence the residues of the genome directly, generating base calls for each position, the Applied Biosystem's SOLiD platform generates dibase-coded (color space) sequences. While combining data from the various platforms should increase the accuracy of variation detection, to date there are only a few tools that can identify variants from color space data, and none that can analyze color space and regular (letter space) data together.
We present VARiD--a probabilistic method for variation detection from both letter- and color-space reads simultaneously. VARiD is based on a hidden Markov model and uses the forward-backward algorithm to accurately identify heterozygous, homozygous and tri-allelic SNPs, as well as micro-indels. Our analysis shows that VARiD performs better than the AB SOLiD toolset at detecting variants from color-space data alone, and improves the calls dramatically when letter- and color-space reads are combined.
The toolset is freely available at http://compbio.cs.utoronto.ca/varid.
高通量测序(HTS)技术正在改变基因组变异的研究方式。各种 HTS 技术具有不同的测序偏差和错误率,虽然大多数 HTS 技术直接对基因组的残基进行测序,为每个位置生成碱基调用,但 Applied Biosystem 的 SOLiD 平台生成双碱基编码(颜色空间)序列。虽然结合来自各种平台的数据应该会提高变异检测的准确性,但迄今为止,只有少数工具可以从颜色空间数据中识别变体,并且没有可以同时分析颜色空间和常规(字母空间)数据的工具。
我们提出了 VARiD-一种同时从字母空间和颜色空间读取数据中进行变异检测的概率方法。VARiD 基于隐马尔可夫模型,并使用前向-后向算法来准确识别杂合子、纯合子和三等位基因 SNP 以及微缺失。我们的分析表明,VARiD 单独从颜色空间数据中检测变体的性能优于 AB SOLiD 工具集,并且当字母空间和颜色空间读取结合使用时,大大改善了调用。
该工具集可在 http://compbio.cs.utoronto.ca/varid 上免费获得。