Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK.
BMC Bioinformatics. 2012 Mar 23;13:47. doi: 10.1186/1471-2105-13-47.
Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified.
Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants.
We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.
下一代测序技术为病毒群体中存在的变异提供了详细的见解,为反应性和预测性治疗策略提供了可能性。然而,当前的软件工具需要扩展以适应通常与时间或空间相关的高深度病毒数据集。此外,由于新型测序平台和化学物质的发展,每种平台和化学物质都有隐含的优势和劣势,因此研究人员能够定期比较和组合来自不同平台/化学物质的数据将很有帮助。特别是,必须量化与特定测序过程相关的错误,以便识别真正的生物学变异。
Segminator II 的开发是为了允许高效比较来自不同来源的数据集。我们通过比较在 454 生命科学和 Illumina 平台上测序的 12 个流感 H1N1 样本的大型数据集来演示其用途,从而量化了平台错误。对于不匹配,中位数错误率分别为 0.10%和 0.12%,表明两个平台的性能相似。对于插入和缺失,中位数错误率在 454 数据中(分别为 0.3%和 0.2%)明显高于在 Illumina 数据中(分别为 0.004%和 0.006%)。与先前的观察结果一致,这些更高的速率与 454 平台上的同源多聚体延伸强烈相关。在这些区域之外,两个平台的插入缺失错误分布相似。此外,我们将我们的软件应用于低频变体的识别。
我们使用 Segminator II 证明,使用来自两个不同平台的数据,可以从生物变异中区分平台特定的错误。我们已经使用这种方法来量化 454 和 Illumina 平台中与基因组位置以及读取位置相关的错误量。鉴于下一代数据在药物耐药性和疫苗试验分析中越来越重要,该软件将对病原体研究界有用。可从 http://www.bioinf.manchester.ac.uk/segminator/ 免费下载包含源代码和 jar 文件的 zip 文件。