Bennett Dominic J, Hettling Hannes, Silvestro Daniele, Zizka Alexander, Bacon Christine D, Faurby Søren, Vos Rutger A, Antonelli Alexandre
Gothenburg Global Biodiversity Centre, Box 461, SE-405 30 Gothenburg, Sweden.
Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE-405 30 Gothenburg, Sweden.
Life (Basel). 2018 Jun 5;8(2):20. doi: 10.3390/life8020020.
The exceptional increase in molecular DNA sequence data in open repositories is mirrored by an ever-growing interest among evolutionary biologists to harvest and use those data for phylogenetic inference. Many quality issues, however, are known and the sheer amount and complexity of data available can pose considerable barriers to their usefulness. A key issue in this domain is the high frequency of sequence mislabeling encountered when searching for suitable sequences for phylogenetic analysis. These issues include, among others, the incorrect identification of sequenced species, non-standardized and ambiguous sequence annotation, and the inadvertent addition of paralogous sequences by users. Taken together, these issues likely add considerable noise, error or bias to phylogenetic inference, a risk that is likely to increase with the size of phylogenies or the molecular datasets used to generate them. Here we present a software package, phylotaR that bypasses the above issues by using instead an alignment search tool to identify orthologous sequences. Our package builds on the framework of its predecessor, PhyLoTa, by providing a modular pipeline for identifying overlapping sequence clusters using up-to-date GenBank data and providing new features, improvements and tools. We demonstrate and test our pipeline's effectiveness by presenting trees generated from phylotaR clusters for two large taxonomic clades: Palms and primates. Given the versatility of this package, we hope that it will become a standard tool for any research aiming to use GenBank data for phylogenetic analysis.
开放数据库中分子DNA序列数据的异常增加,反映出进化生物学家对收集和利用这些数据进行系统发育推断的兴趣与日俱增。然而,许多质量问题是已知的,可用数据的数量和复杂性可能会对其有用性构成相当大的障碍。该领域的一个关键问题是,在为系统发育分析寻找合适序列时,序列错误标注的频率很高。这些问题包括,除其他外,测序物种的错误识别、非标准化和模糊的序列注释,以及用户无意中添加的旁系同源序列。综上所述,这些问题可能会给系统发育推断增加相当多的噪声、误差或偏差,随着系统发育树的规模或用于生成它们的分子数据集的规模增大,这种风险可能会增加。在这里,我们展示了一个软件包phylotaR,它通过使用比对搜索工具来识别直系同源序列,从而绕过了上述问题。我们的软件包基于其前身PhyLoTa的框架构建,通过提供一个模块化管道,利用最新的GenBank数据识别重叠序列簇,并提供新的功能、改进和工具。我们通过展示从phylotaR簇为两个大型分类分支(棕榈科和灵长目)生成的树,来演示和测试我们管道的有效性。鉴于这个软件包的多功能性,我们希望它将成为任何旨在利用GenBank数据进行系统发育分析的研究的标准工具。