State Key Laboratory of Earth Surface Processes and Resource Ecology and Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
Mol Ecol Resour. 2023 Feb;23(2):499-510. doi: 10.1111/1755-0998.13720. Epub 2022 Nov 1.
Polyploidy is ubiquitous and its consequences are complex and variable. A change of ploidy level generally influences genetic diversity and results in morphological, physiological and ecological differences between cells or organisms with different ploidy levels. To avoid cumbersome experiments and take advantage of the less biased information provided by the vast amounts of genome sequencing data, computational tools for ploidy estimation are urgently needed. Until now, although a few such tools have been developed, many aspects of this estimation, such as the requirement of a reference genome, the lack of informative results and objective inferences, and the influence of false positives from errors and repeats, need further improvement. We have developed ploidyfrost, a de Bruijn graph-based method, to estimate ploidy levels from whole genome sequencing data sets without a reference genome. ploidyfrost provides a visual representation of allele frequency distribution generated using the ggplot2 package as well as quantitative results using the Gaussian mixture model. In addition, it takes advantage of colouring information encoded in coloured de Bruijn graphs to analyse multiple samples simultaneously and to flexibly filter putative false positives. We evaluated the performance of ploidyfrost by analysing highly heterozygous or repetitive samples of Cyclocarya paliurus and a complex allooctoploid sample of Fragaria × ananassa. Moreover, we demonstrated that the accuracy of analysis results can be improved by constraining a threshold such as Cramér's V coefficient on variant features, which may significantly reduce the side effects of sequencing errors and annoying repeats on the graphical structure constructed.
多倍体是普遍存在的,其后果是复杂和多变的。倍性水平的变化通常会影响遗传多样性,并导致不同倍性水平的细胞或生物体在形态、生理和生态上的差异。为了避免繁琐的实验,并利用大量基因组测序数据提供的信息偏差较小的优势,迫切需要开发用于倍性估计的计算工具。到目前为止,尽管已经开发了一些这样的工具,但在这种估计的许多方面,例如需要参考基因组、缺乏有信息的结果和客观推断、以及错误和重复产生的假阳性的影响,都需要进一步改进。我们开发了 ploidyfrost,这是一种基于 de Bruijn 图的方法,可以在没有参考基因组的情况下从全基因组测序数据集估计倍性水平。ploidyfrost 使用 ggplot2 包生成等位基因频率分布的可视化表示,以及使用高斯混合模型的定量结果。此外,它还利用彩色 de Bruijn 图中编码的颜色信息来同时分析多个样本,并灵活地过滤可疑的假阳性。我们通过分析高度杂合或重复的 Cyclocarya paliurus 样本和 Fragaria × ananassa 的复杂 alloctoploid 样本来评估 ploidyfrost 的性能。此外,我们还证明通过在变异特征上限制 Cramér V 系数等阈值,可以提高分析结果的准确性,这可能会显著减少测序错误和恼人的重复对构建图形结构的副作用。