• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

phastSim:针对大流行规模数据集的序列进化高效模拟

phastSim: efficient simulation of sequence evolution for pandemic-scale datasets.

作者信息

De Maio Nicola, Boulton William, Weilguny Lukas, Walker Conor R, Turakhia Yatish, Corbett-Detig Russell, Goldman Nick

机构信息

European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.

Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK.

出版信息

bioRxiv. 2021 Sep 23:2021.03.15.435416. doi: 10.1101/2021.03.15.435416.

DOI:10.1101/2021.03.15.435416
PMID:33758852
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7987011/
Abstract

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.

摘要

序列模拟器是生物信息学中的基础工具,因为它们使我们能够测试数据处理和推理工具,并且还是一些推理方法的组成部分。然而,现有序列数据的持续激增正在考验我们生物信息学软件的极限。一个例子是大量可用的新冠病毒基因组,它们超出了许多方法的处理能力,而且模拟如此庞大的数据集也很困难。在此,我们提出一种新的算法和软件,用于在树的分支较短时(如基因组流行病学中常见的情况),沿着极大的树(例如,末梢数>100,000)高效模拟序列进化。我们的算法基于 Gillespie 方法,并实现了一种高效的多层搜索树结构,通过利用在考虑的系统发育树的每个分支上只有一小部分基因组可能发生突变这一事实,提供了高计算效率。我们的开源软件可从 https://github.com/NicolaDM/phastSim 获取,它允许与其他 Python 包以及各种进化模型轻松集成,包括我们为更真实地表示新冠病毒基因组进化而开发的插入缺失模型和新的高突变性模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/a6301d181039/nihpp-2021.03.15.435416v2-f0014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/e1ad7366c3a3/nihpp-2021.03.15.435416v2-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/90174ffe4f61/nihpp-2021.03.15.435416v2-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/e1cb8177e023/nihpp-2021.03.15.435416v2-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/2f37eb27cad8/nihpp-2021.03.15.435416v2-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/1a193856cc8e/nihpp-2021.03.15.435416v2-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/6faf12e035c3/nihpp-2021.03.15.435416v2-f0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/a6301d181039/nihpp-2021.03.15.435416v2-f0014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/e1ad7366c3a3/nihpp-2021.03.15.435416v2-f0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/90174ffe4f61/nihpp-2021.03.15.435416v2-f0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/e1cb8177e023/nihpp-2021.03.15.435416v2-f0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/2f37eb27cad8/nihpp-2021.03.15.435416v2-f0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/1a193856cc8e/nihpp-2021.03.15.435416v2-f0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/6faf12e035c3/nihpp-2021.03.15.435416v2-f0013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a261/8462496/a6301d181039/nihpp-2021.03.15.435416v2-f0014.jpg

相似文献

1
phastSim: efficient simulation of sequence evolution for pandemic-scale datasets.phastSim:针对大流行规模数据集的序列进化高效模拟
bioRxiv. 2021 Sep 23:2021.03.15.435416. doi: 10.1101/2021.03.15.435416.
2
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets.phastSim:用于大流行规模数据集的序列进化的高效模拟。
PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056. eCollection 2022 Apr.
3
Taxonium, a web-based tool for exploring large phylogenetic trees.Taxonium,一个用于探索大型系统发育树的网络工具。
Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.
4
NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.NeuroPycon:一个开源的 Python 工具包,用于快速进行多模态和可重复的脑连接管道。
Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.
5
SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies.SimSpliceEvol2:具有剪接体识别功能的生物序列进化和转录系统发育模拟。
BMC Bioinformatics. 2024 Jul 11;25(1):235. doi: 10.1186/s12859-024-05853-z.
6
TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity.TopHap:从具有有限多样性的大型基因组集中的常见单倍型中快速推断关键系统发育结构。
Bioinformatics. 2022 May 13;38(10):2719-2726. doi: 10.1093/bioinformatics/btac186.
7
MulRF: a software package for phylogenetic analysis using multi-copy gene trees.MulRF:一个使用多拷贝基因树进行系统发育分析的软件包。
Bioinformatics. 2015 Feb 1;31(3):432-3. doi: 10.1093/bioinformatics/btu648. Epub 2014 Oct 1.
8
Invariant transformers of Robinson and Foulds distance matrices for Convolutional Neural Network.不变的 Robinson 和 Foulds 距离矩阵变换用于卷积神经网络。
J Bioinform Comput Biol. 2022 Aug;20(4):2250012. doi: 10.1142/S0219720022500123. Epub 2022 Jul 6.
9
Robust expansion of phylogeny for fast-growing genome sequence data.快速增长的基因组序列数据的系统发育稳健扩展。
PLoS Comput Biol. 2024 Feb 8;20(2):e1011871. doi: 10.1371/journal.pcbi.1011871. eCollection 2024 Feb.
10
SimSpliceEvol: alternative splicing-aware simulation of biological sequence evolution.SimSpliceEvol:具有可变剪接意识的生物序列进化模拟。
BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):640. doi: 10.1186/s12859-019-3207-5.

本文引用的文献

1
VGsim: Scalable viral genealogy simulator for global pandemic.VGsim:用于全球大流行的可扩展病毒系统发育模拟器。
PLoS Comput Biol. 2022 Aug 24;18(8):e1010409. doi: 10.1371/journal.pcbi.1010409. eCollection 2022 Aug.
2
A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees.每日更新的 SARS-CoV-2 突变注释树综合数据库和工具。
Mol Biol Evol. 2021 Dec 9;38(12):5819-5824. doi: 10.1093/molbev/msab264.
3
Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.
超快现有树木样本放置 (UShER) 可实现 SARS-CoV-2 大流行的实时系统发生学。
Nat Genet. 2021 Jun;53(6):809-816. doi: 10.1038/s41588-021-00862-7. Epub 2021 May 10.
4
Mutation Rates and Selection on Synonymous Mutations in SARS-CoV-2.SARS-CoV-2 中同义突变的突变率和选择。
Genome Biol Evol. 2021 May 7;13(5). doi: 10.1093/gbe/evab087.
5
Want to track pandemic variants faster? Fix the bioinformatics bottleneck.想更快追踪新冠病毒变种?解决生物信息学瓶颈问题。
Nature. 2021 Mar;591(7848):30-33. doi: 10.1038/d41586-021-00525-x.
6
Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult.对 SARS-CoV-2 数据进行系统发育分析很困难。
Mol Biol Evol. 2021 May 4;38(5):1777-1791. doi: 10.1093/molbev/msaa314.
7
Stability of SARS-CoV-2 phylogenies.SARS-CoV-2 系统发育的稳定性。
PLoS Genet. 2020 Nov 18;16(11):e1009175. doi: 10.1371/journal.pgen.1009175. eCollection 2020 Nov.
8
The emergence of SARS-CoV-2 in Europe and North America.SARS-CoV-2 在欧洲和北美的出现。
Science. 2020 Oct 30;370(6516):564-570. doi: 10.1126/science.abc8169. Epub 2020 Sep 10.
9
Evidence for Strong Mutation Bias toward, and Selection against, U Content in SARS-CoV-2: Implications for Vaccine Design.有证据表明 SARS-CoV-2 强烈偏向 U 含量突变,并对 U 含量选择淘汰:对疫苗设计的影响。
Mol Biol Evol. 2021 Jan 4;38(1):67-83. doi: 10.1093/molbev/msaa188.
10
Distinguishing Felsenstein Zone from Farris Zone Using Neural Networks.使用神经网络区分费森斯坦区和法里斯区。
Mol Biol Evol. 2020 Dec 16;37(12):3632-3641. doi: 10.1093/molbev/msaa164.