De Maio Nicola, Boulton William, Weilguny Lukas, Walker Conor R, Turakhia Yatish, Corbett-Detig Russell, Goldman Nick
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK.
bioRxiv. 2021 Sep 23:2021.03.15.435416. doi: 10.1101/2021.03.15.435416.
Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.
序列模拟器是生物信息学中的基础工具,因为它们使我们能够测试数据处理和推理工具,并且还是一些推理方法的组成部分。然而,现有序列数据的持续激增正在考验我们生物信息学软件的极限。一个例子是大量可用的新冠病毒基因组,它们超出了许多方法的处理能力,而且模拟如此庞大的数据集也很困难。在此,我们提出一种新的算法和软件,用于在树的分支较短时(如基因组流行病学中常见的情况),沿着极大的树(例如,末梢数>100,000)高效模拟序列进化。我们的算法基于 Gillespie 方法,并实现了一种高效的多层搜索树结构,通过利用在考虑的系统发育树的每个分支上只有一小部分基因组可能发生突变这一事实,提供了高计算效率。我们的开源软件可从 https://github.com/NicolaDM/phastSim 获取,它允许与其他 Python 包以及各种进化模型轻松集成,包括我们为更真实地表示新冠病毒基因组进化而开发的插入缺失模型和新的高突变性模型。