Department of Biology, Utah Valley University, 800 W. University Parkway, Orem, UT 84058, USA.
Genetics. 2024 Aug 7;227(4). doi: 10.1093/genetics/iyae099.
The number of genome assemblies has rapidly increased in recent history, with NCBI databases reaching over 41,000 eukaryotic genome assemblies across about 2,300 species. Increases in read length and improvements in assembly algorithms have led to increased contiguity and larger genome assemblies. While this number of assemblies is impressive, only about a third of these assemblies have corresponding genome size estimations for their respective species on publicly available databases. In this paper, genome assemblies are assessed regarding their total size compared to their respective publicly available genome size estimations. These deviations in size are assessed related to genome size, kingdom, sequencing platform, and standard assembly metrics, such as N50 and BUSCO values. A large proportion of assemblies deviate from their estimated genome size by more than 10%, with increasing deviations in size with increased genome size, suggesting nonprotein coding and structural DNA may be to blame. Modest differences in performance of sequencing platforms are noted as well. While standard metrics of genome assessment are more likely to indicate an assembly approaching the estimated genome size, much of the variation in this deviation in size is not explained with these raw metrics. A new, proportional N50 metric is proposed, in which N50 values are made relative to the average chromosome size of each species. This new metric has a stronger relationship with complete genome assemblies and, due to its proportional nature, allows for a more direct comparison across assemblies for genomes with variation in sizes and architectures.
近年来,基因组组装的数量迅速增加,NCBI 数据库中约有 2300 个物种的基因组组装数量超过了 41000 个。读长的增加和组装算法的改进导致了连续性的提高和更大的基因组组装。尽管这个数量的组装令人印象深刻,但在公共数据库中,只有大约三分之一的组装有相应物种的基因组大小估计值。在本文中,我们评估了基因组组装与其各自在公共数据库中估计的基因组大小的总大小相比的情况。这些大小上的偏差与基因组大小、生物界、测序平台以及标准组装指标(如 N50 和 BUSCO 值)有关。很大一部分组装与其估计的基因组大小相差超过 10%,随着基因组大小的增加,大小偏差也越来越大,这表明非蛋白编码和结构 DNA可能是罪魁祸首。测序平台性能的微小差异也被注意到。虽然基因组评估的标准指标更有可能表明组装接近估计的基因组大小,但这种大小偏差的大部分变异无法用这些原始指标来解释。提出了一种新的、比例化的 N50 指标,其中 N50 值相对于每个物种的平均染色体大小。这个新指标与完整基因组组装的关系更强,并且由于其比例化性质,允许在大小和结构不同的基因组之间进行更直接的组装比较。