Jenke Nils, Smith Gregory M, Magar Buddha Thapa, Gruenstaeudl Michael
Freie Universität Berlin, Institut für Bioinformatik, Berlin, 14195, Germany.
Fort Hays State University, Department of Computer Science, Hays, 67601, Kansas, USA.
Res Sq. 2025 Jul 14:rs.3.rs-5784537. doi: 10.21203/rs.3.rs-5784537/v1.
Depth and evenness of sequencing coverage are considered potential indicators of genome assembly quality. In plastid genomics, where new data generation has outpaced the development of assembly quality indicators, these coverage metrics could offer insights into the quality of plastomes of different sizes, structures, or taxonomic origins. However, the variation of sequencing depth and evenness among archived plastid genomes, their variability between genome partitions, and any association with methodological factors have yet to be evaluated. This study explores the variation of sequencing depth and evenness across a sample of publicly accessible plastid genomes in relation to their genome structure, assembly quality, and methodological provenance using uni- and multivariate statistical analyses. We also evaluate whether sequencing evenness in plastid genomes is biased by phylogenetic signal and assembly software choice, and whether more uniformly distributed input sequence data improves plastome assembly quality. Our results indicate significant differences in sequencing depth across the four structural partitions and between the coding and non-coding regions of plastid genomes, a significant correlation between sequencing evenness and the number of ambiguous nucleotides, and a significant difference in sequencing evenness between sequencing platforms. However, we also find that different covariates representing additional, lesser explored factors often show a similar, if not greater, explanatory power for the coverage variation. No indications of phylogenetic or software choice bias on sequencing evenness and only weak indications of phylogenetic bias among the assembly quality metrics are detected, suggesting that our study results represent genuine patterns. We also find that normalizing the distribution of the input sequence data before plastome assembly may improve assembly accuracy. Taken together, these findings highlight that many public plastid genomes derive from sequence data with highly variable depth and evenness, and that this variation is influenced, at least partially, by genome structure as well as methodological factors.
测序覆盖度的深度和均匀度被视为基因组组装质量的潜在指标。在质体基因组学领域,新数据的产生速度超过了组装质量指标的发展速度,这些覆盖度指标可为不同大小、结构或分类学来源的质体基因组质量提供见解。然而,存档质体基因组之间测序深度和均匀度的变化、它们在基因组分区之间的变异性以及与方法学因素的任何关联尚未得到评估。本研究使用单变量和多变量统计分析,探讨了公开可获取的质体基因组样本中测序深度和均匀度的变化与其基因组结构、组装质量和方法学来源的关系。我们还评估了质体基因组中的测序均匀度是否受系统发育信号和组装软件选择的影响,以及输入序列数据分布更均匀是否能提高质体基因组组装质量。我们的结果表明,质体基因组的四个结构分区以及编码区和非编码区之间的测序深度存在显著差异,测序均匀度与模糊核苷酸数量之间存在显著相关性,测序平台之间的测序均匀度也存在显著差异。然而,我们还发现,代表其他较少探索因素的不同协变量通常对覆盖度变化显示出相似(如果不是更大)的解释力。未检测到系统发育或软件选择对测序均匀度的偏差迹象,在组装质量指标中仅检测到微弱系统发育偏差迹象,这表明我们的研究结果代表了真实模式。我们还发现,在质体基因组组装之前对输入序列数据的分布进行归一化处理可能会提高组装准确性。综上所述,这些发现突出表明,许多公开的质体基因组源自深度和均匀度高度可变的序列数据,并且这种变化至少部分受基因组结构以及方法学因素的影响。