Almeida Jonas S, Grüneberg Alexander, Maass Wolfgang, Vinga Susana
Div Informatics, Dept Pathology, University of Alabama at Birmingham, USA.
Algorithms Mol Biol. 2012 May 2;7(1):12. doi: 10.1186/1748-7188-7-12.
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.
Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
基因组测序成本的急剧下降,以及分布式云计算资源日益增加的便利性,使得MapReduce编码模式成为可扩展生物信息学算法开发的基石。在某些情况下,算法将通过使用映射函数来处理向量化组件找到自然分布,随后对聚合中间结果进行归约。然而,对于一些数据分析过程,如序列分析,可能需要更根本的重新表述。
在本报告中,我们描述了一种序列比较的解决方案,该方案可以彻底分解为多轮映射和归约操作。所采用的方法利用了迭代映射,这是一种分形分析技术,已被发现可为序列分析和比较提供“无比对”解决方案。也就是说,一种不需要动态规划的解决方案,依赖于数字混沌游戏表示(CGR)数据结构。本报告通过仅检查两个类似单元的USM坐标来计算最长相似片段的长度来证明这一说法:无需借助动态规划。
所描述的过程是对序列比对进行极端分解和并行化的尝试,以应对当前算法框架无法处理的大量基因组序列数据。找到的解决方案通过基于浏览器的应用程序(webApp)提供,突出了浏览器作为高性能分布式计算环境的出现。