Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
BMC Genomics. 2020 Feb 26;21(1):183. doi: 10.1186/s12864-020-6597-x.
Whole-genome approaches are widely preferred for species delineation in prokaryotes. However, these methods require pairwise alignments and calculations at the whole-genome level and thus are computationally intensive. To address this problem, a strategy consisting of sieving (pre-selecting closely related genomes) followed by alignment and calculation has been proposed.
Here, we initially test a published approach called "genome-wide tetranucleotide frequency correlation coefficient" (TETRA), which is specially tailored for sieving. Our results show that sieving by TETRA requires > 40% completeness for both genomes of a pair to yield > 95% sensitivity, indicating that TETRA is completeness-dependent. Accordingly, we develop a novel algorithm called "fragment tetranucleotide frequency correlation coefficient" (FRAGTE), which uses fragments rather than whole genomes for sieving. Our results show that FRAGTE achieves ~ 100% sensitivity and high specificity on simulated genomes, real genomes and metagenome-assembled genomes, demonstrating that FRAGTE is completeness-independent. Additionally, FRAGTE sieved a reduced number of total genomes for subsequent alignment and calculation to greatly improve computational efficiency for the process after sieving. Aside from this computational improvement, FRAGTE also reduces the computational cost for the sieving process. Consequently, FRAGTE extremely improves run efficiency for both the processes of sieving and after sieving (subsequent alignment and calculation) to together accelerate genome-wide species delineation.
FRAGTE is a completeness-independent algorithm for sieving. Due to its high sensitivity, high specificity, highly reduced number of sieved genomes and highly improved runtime, FRAGTE will be helpful for whole-genome approaches to facilitate taxonomic studies in prokaryotes.
全基因组方法广泛应用于原核生物的物种划分。然而,这些方法需要进行两两比对和全基因组水平的计算,因此计算量很大。为了解决这个问题,提出了一种由筛选(预选密切相关的基因组)然后进行比对和计算的策略。
我们最初测试了一种名为“全基因组四核苷酸频率相关系数”(TETRA)的已发表方法,该方法专门用于筛选。我们的结果表明,TETRA 筛选要求一对基因组的完整性 > 40%,才能产生 > 95%的敏感性,表明 TETRA 依赖于完整性。因此,我们开发了一种名为“片段四核苷酸频率相关系数”(FRAGTE)的新算法,该算法使用片段而不是整个基因组进行筛选。我们的结果表明,FRAGTE 在模拟基因组、真实基因组和宏基因组组装基因组上实现了约 100%的敏感性和高特异性,表明 FRAGTE 不依赖于完整性。此外,FRAGTE 筛选了较少数量的总基因组进行后续比对和计算,从而大大提高了筛选后过程的计算效率。除了这种计算改进之外,FRAGTE 还降低了筛选过程的计算成本。因此,FRAGTE 极大地提高了筛选和筛选后(后续比对和计算)过程的运行效率,共同加速了全基因组的物种划分。
FRAGTE 是一种不依赖于完整性的筛选算法。由于其高敏感性、高特异性、筛选基因组数量大幅减少以及运行时间大幅提高,FRAGTE 将有助于全基因组方法促进原核生物的分类学研究。