Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, UK.
BMC Bioinformatics. 2021 Mar 9;22(1):115. doi: 10.1186/s12859-021-04048-0.
Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms' classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species' demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling.
Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions.
There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.
如今,大量的遗传序列数据存储在公共可用的存储库中。几十年来,线粒体 DNA(mtDNA)一直是遗传研究的主力军,因此,这些存储库中为广泛的物种提供了大量的 mtDNA 数据。事实上,虽然全基因组测序是未来的一个令人兴奋的前景,但对于大多数非模式生物来说,经典标记如 mtDNA 仍然被广泛使用。通过整合来自多个原始研究的现有数据,可以构建功能强大的新数据集,从而能够探索生态学、进化和保护生物学中的许多问题。这些数据可以帮助回答的一个关键问题是,物种的种群历史中发生了什么。然而,以这种方式进行数据整合并不简单,数据提取、数据质量和数据处理都存在许多复杂性。
这里我们介绍了 mtDNAcombine 包,这是一组工具,用于管理与处理多研究序列数据相关的一些主要决策,特别是侧重于为贝叶斯天空线图人口重建准备序列数据。
现在可用的遗传信息比以往任何时候都多,大型元数据集为探索新的和令人兴奋的研究途径提供了巨大的机会。然而,编译多研究数据集仍然是一个具有技术挑战性的前景。mtDNAcombine 包提供了一个流程,用于简化下载、管理和分析序列数据的过程,指导从在线数据库 GenBank 中编译数据集的过程。