Gärtner Fabian, Höner Zu Siederdissen Christian, Müller Lydia, Stadler Peter F
1Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, 04107 Leipzig, Germany.
2Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany.
Algorithms Mol Biol. 2018 Sep 24;13:15. doi: 10.1186/s13015-018-0133-4. eCollection 2018.
Genome sequences and genome annotation data have become available at ever increasing rates in response to the rapid progress in sequencing technologies. As a consequence the demand for methods supporting comparative, evolutionary analysis is also growing. In particular, efficient tools to visualize-omics data simultaneously for multiple species are sorely lacking. A first and crucial step in this direction is the construction of a common coordinate system. Since genomes not only differ by rearrangements but also by large insertions, deletions, and duplications, the use of a single reference genome is insufficient, in particular when the number of species becomes large.
The computational problem then becomes to determine an order and orientations of optimal local alignments that are as co-linear as possible with all the genome sequences. We first review the most prominent approaches to model the problem formally and then proceed to showing that it can be phrased as a particular variant of the Betweenness Problem. It is NP hard in general. As exact solutions are beyond reach for the problem sizes of practical interest, we introduce a collection of heuristic simplifiers to resolve ordering conflicts.
Benchmarks on real-life data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems.
随着测序技术的飞速发展,基因组序列和基因组注释数据的获取速度不断加快。因此,对支持比较和进化分析方法的需求也在增加。特别是,非常缺乏能同时为多个物种可视化组学数据的高效工具。朝着这个方向迈出的第一步也是关键一步是构建一个通用坐标系。由于基因组不仅因重排不同,还因大量插入、缺失和重复而不同,使用单个参考基因组是不够的,尤其是当物种数量增多时。
计算问题就变成了确定与所有基因组序列尽可能共线的最优局部比对的顺序和方向。我们首先回顾了正式建模该问题的最突出方法,然后证明它可以表述为中间性问题的一个特定变体。一般来说它是NP难问题。由于对于实际感兴趣的问题规模,精确解难以实现,我们引入了一组启发式简化方法来解决排序冲突。
从细菌到果蝇基因组的实际数据基准测试证明了计算良好通用坐标系的可行性。