School of Biological Sciences, The University of Manchester, Manchester M13 9PT, UK.
Modernising Medical Microbiology Consortium, Nuffield Department of Clinical Medicine, John Radcliffe Hospital, University of Oxford, Oxford OX3 9DU, UK.
Viruses. 2019 Apr 26;11(5):394. doi: 10.3390/v11050394.
Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data.
DNA 测序技术的进步正在促进前所未有的范围和规模的基因组分析,扩大了我们生成和充分利用生物序列数据的能力之间的差距。在涉及顺序数据的其他数据密集型领域中也遇到了类似的分析挑战,例如信号处理,其中通常使用降维(即压缩)方法来减轻分析的计算负担。在这项工作中,我们探索了降维方法在数值表示高通量序列数据方面的应用,这些数据对于病毒序列数据的三个重要生物学应用具有重要意义:基于参考的映射、短序列分类和从头组装。利用高度压缩的序列变换来加速序列比较,我们的方法与现有方法的准确性相当,进一步证明了它适用于来自不同病毒群体的序列。我们使用合成和真实病毒病原体序列评估了我们方法的应用。我们的结果表明,使用高度压缩的序列近似值可以提供准确的结果,通过对序列数据进行适当的降维,可以保留甚至增强分析性能。