Department of Ecology and Evolution, Le Biophore UNIL-Sorge, University of Lausanne, Lausanne 1015, Switzerland.
Evolutionary-Functional Genomics Group, L'Amphipole UNIL-Sorge, Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland.
Gigascience. 2022 Feb 25;11. doi: 10.1093/gigascience/giac006.
Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. To guide forthcoming genome generation efforts and promote efficient prioritization of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data.
Here we present an automated analysis workflow that surveys genome assemblies from the United States NCBI, assesses their completeness using the relevant BUSCO datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, examine how key assembly metrics relate to gene content completeness, and compare results from using different BUSCO lineage datasets.
These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritizations for ongoing and future sampling, sequencing, and genome generation initiatives.
协调地球生物多样性基因组测序的雄心勃勃的举措意味着基因组数据的积累正在迅速增长。除了编目生物多样性外,这些数据还为理解生物功能和进化提供了基础。准确和完整的基因组组装为在遗传、物种和生态系统水平上推进我们对生物个体生物学的理解提供了全面而可靠的基础。然而,不断变化的测序技术和分析方法意味着可用数据的质量往往存在异质性。为了指导即将进行的基因组生成工作并促进资源的有效优先排序,因此必须定义和监测数据的分类覆盖范围和质量。
在这里,我们提出了一种自动化分析工作流程,该流程调查了美国 NCBI 的基因组组装,使用相关的 BUSCO 数据集评估其完整性,并将结果整理到一个可交互浏览的资源中。我们应用我们的工作流程来生成门节肢动物门的可用组装的社区资源,即节肢动物组装评估目录。使用这个资源,我们调查了 NCBI 目前的分类覆盖范围和组装质量,研究了关键组装指标与基因内容完整性的关系,并比较了使用不同 BUSCO 谱系数据集的结果。
这些结果表明,该工作流程如何用于构建一个社区资源,从而能够进行大规模评估,以调查可用基因组组装的物种覆盖范围和数据质量,并指导正在进行和未来的采样、测序和基因组生成计划的优先排序。