Department of Biochemistry, Genetics and Immunology.
Biomedical Research Center (CINBIO), University of Vigo Vigo, Spain.
Bioinformatics. 2018 Jul 15;34(14):2506-2507. doi: 10.1093/bioinformatics/bty146.
Advances in sequencing technologies have made it feasible to obtain massive datasets for phylogenomic inference, often consisting of large numbers of loci from multiple species and individuals. The phylogenomic analysis of next-generation sequencing (NGS) data requires a complex computational pipeline where multiple technical and methodological decisions are necessary that can influence the final tree obtained, like those related to coverage, assembly, mapping, variant calling and/or phasing.
To assess the influence of these variables we introduce NGSphy, an open-source tool for the simulation of Illumina reads/read counts obtained from haploid/diploid individual genomes with thousands of independent gene families evolving under a common species tree. In order to resemble real NGS experiments, NGSphy includes multiple options to model sequencing coverage (depth) heterogeneity across species, individuals and loci, including off-target or uncaptured loci. For comprehensive simulations covering multiple evolutionary scenarios, parameter values for the different replicates can be sampled from user-defined statistical distributions.
Source code, full documentation and tutorials including a 'Getting started' guide are available at http://github.com/merlyescalona/ngsphy.
Supplementary data are available at Bioinformatics online.
测序技术的进步使得对系统发育推断进行大规模数据集的获取成为可能,这些数据集通常由来自多个物种和个体的大量基因座组成。下一代测序(NGS)数据的系统发育分析需要一个复杂的计算流程,其中需要做出多个技术和方法学决策,这些决策可能会影响最终获得的树,例如与覆盖度、组装、映射、变异调用和/或相位有关的决策。
为了评估这些变量的影响,我们引入了 NGSphy,这是一种开源工具,用于模拟 Illumina 读取/读取计数,这些读取计数来自具有数千个独立基因家族的单倍体/二倍体个体基因组,这些基因家族在共同的物种树下进化。为了模拟真实的 NGS 实验,NGSphy 包括多种选项来模拟跨物种、个体和基因座的测序覆盖度(深度)异质性,包括非靶标或未捕获的基因座。对于涵盖多个进化场景的综合模拟,可以从用户定义的统计分布中对不同重复的参数值进行抽样。
源代码、完整文档和教程(包括“入门指南”)可在 http://github.com/merlyescalona/ngsphy 上获得。
补充数据可在生物信息学在线获得。