安迪：快速准确地估计密切相关基因组之间的进化距离。

andi: fast and accurate estimation of evolutionary distances between closely related genomes.

作者信息

Haubold Bernhard, Klötzl Fabian, Pfaffelhuber Peter

机构信息

Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany, Institue for Neuro- and Bioinformatics, Lübeck University, 23562 Lübeck, Germany and Mathematical Stochastics, Mathematical Institute, Freiburg University, Germany.

出版信息

Bioinformatics. 2015 Apr 15;31(8):1169-75. doi: 10.1093/bioinformatics/btu815. Epub 2014 Dec 10.

DOI:10.1093/bioinformatics/btu815

PMID:25504847

Abstract

MOTIVATION

A standard approach to classifying sets of genomes is to calculate their pairwise distances. This is difficult for large samples. We have therefore developed an algorithm for rapidly computing the evolutionary distances between closely related genomes.

RESULTS

Our distance measure is based on ungapped local alignments that we anchor through pairs of maximal unique matches of a minimum length. These exact matches can be looked up efficiently using enhanced suffix arrays and our implementation requires approximately only 1 s and 45 MB RAM/Mbase analysed. The pairing of matches distinguishes non-homologous from homologous regions leading to accurate distance estimation. We show this by analysing simulated data and genome samples ranging from 29 Escherichia coli/Shigella genomes to 3085 genomes of Streptococcus pneumoniae.

AVAILABILITY AND IMPLEMENTATION

We have implemented the computation of anchor distances in the multithreaded UNIX command-line program andi for ANchor DIstances. C sources and documentation are posted at http://github.com/evolbioinf/andi/

CONTACT

haubold@evolbio.mpg.de

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

对基因组集合进行分类的标准方法是计算它们之间的成对距离。对于大样本而言，这很困难。因此，我们开发了一种算法，用于快速计算密切相关基因组之间的进化距离。

结果

我们的距离度量基于无间隙局部比对，我们通过一对最小长度的最大唯一匹配来定位这些比对。使用增强后缀数组可以高效地查找这些精确匹配，并且我们的实现每分析1兆碱基大约仅需要1秒和45兆字节的随机存取存储器。匹配的配对可区分非同源区域和同源区域，从而实现准确的距离估计。我们通过分析模拟数据以及从29个大肠杆菌/志贺氏菌基因组到3085个肺炎链球菌基因组的基因组样本证明了这一点。