School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
Research Institute of Xi'an Jiaotong University, Zhejiang, Hangzhou 311200, China.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae647.
Estimating genome size using k-mer frequencies, which plays a fundamental role in designing genome sequencing and analysis projects, has remained challenging for polyploid species, i.e., ploidy p > 2. To address this, we introduce "findGSEP," which is designed based on iterative curve fitting of k-mer frequencies. Precisely, it first disentangles up to p normal distributions by analyzing k-mer frequencies in whole genome sequencing of the focal species. Second, it computes the sizes of genomic regions related to 1∼p (homologous) chromosome(s) using each respective curve fitting, from which it infers the full polyploid and average haploid genome size. "findGSEP" can handle any level of ploidy p, and infer more accurate genome size than other well-known tools, as shown by tests using simulated and real genomic sequencing data of various species including octoploids.
"findGSEP" was implemented as a web server, which is freely available at http://146.56.237.198:3838/findGSEP/. Also, "findGSEP" was implemented as an R package for parallel processing of multiple samples. Source code and tutorial on its installation and usage is available at https://github.com/sperfu/findGSEP.
使用 k-mer 频率估计基因组大小在设计基因组测序和分析项目中起着至关重要的作用,但对于多倍体物种(即ploidy p > 2)来说,这仍然具有挑战性。为了解决这个问题,我们引入了“findGSEP”,它是基于 k-mer 频率的迭代曲线拟合设计的。具体来说,它首先通过分析焦点物种的全基因组测序中的 k-mer 频率,通过分析将多达 p 个正态分布分离出来。其次,它使用每个相应的曲线拟合来计算与 1∼p(同源)染色体相关的基因组区域的大小,从中推断出完整的多倍体和平均单倍体基因组大小。“findGSEP”可以处理任何倍数的 p,并且可以比其他知名工具更准确地推断基因组大小,这在使用各种物种(包括八倍体)的模拟和真实基因组测序数据进行的测试中得到了证明。
“findGSEP”被实现为一个网络服务器,可以在 http://146.56.237.198:3838/findGSEP/ 上免费获得。此外,“findGSEP”还被实现为一个用于并行处理多个样本的 R 包。其安装和使用的源代码和教程可在 https://github.com/sperfu/findGSEP 上获得。