Alam Md Nafis Ul, Román-Palacios Cristian, Copetti Dario, Wing Rod A
Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ, USA.
Plant Biotechnology Laboratory, Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh.
BMC Biol. 2025 Jul 28;23(1):224. doi: 10.1186/s12915-025-02328-2.
Universal single-copy orthologs are the most conserved components of genomes. Although they are routinely used for studying evolutionary histories and assessing new assemblies, current methods do not incorporate information from available genomic data.
Here, we first determine the influence of evolutionary history on universal gene content and find that across 11,098 genomes of plants, fungi, and animals comprising 2606 taxonomic groups, 215 groups significantly vary from their respective lineages in terms of BUSCO (Benchmarking Universal Single Copy Orthologs) completeness. Additionally, 169 groups display an elevated complement of duplicated orthologs, likely from ancestral whole genome duplication events. Secondly, we investigate the extent of taxonomic congruence in broad BUSCO-derived phylogenies. For 275 suitable families out of 543 tested, sites evolving at higher rates produce at most 23.84% more taxonomically concordant, and at least 46.15% less terminally variable phylogenies compared to lower-rate sites. We find that BUSCO concatenated and coalescent trees have comparable accuracy and conclude that higher rate sites from concatenated alignments produce the most congruent and least variable phylogenies. Finally, we show that undetected, yet pervasive BUSCO gene loss events lead to misrepresentations of assembly quality. To overcome this, we filter a Curated set of BUSCOs (CUSCOs) that provide up to 6.99% fewer false positives compared to the standard search and introduce novel methods for comparing assemblies using gene synteny.
Overall, we highlight the importance of considering evolutionary histories during assembly evaluations and release the phyca software toolkit that reconstructs consistent phylogenies and offers more precise assembly assessments.
通用单拷贝直系同源基因是基因组中最保守的成分。尽管它们经常被用于研究进化历史和评估新的基因组组装,但目前的方法并未纳入来自现有基因组数据的信息。
在这里,我们首先确定进化历史对通用基因含量的影响,发现在包含2606个分类群的11098个植物、真菌和动物基因组中,有215个分类群在BUSCO(基准通用单拷贝直系同源基因)完整性方面与其各自的谱系有显著差异。此外,169个分类群显示出重复直系同源基因的补充增加,可能来自祖先的全基因组复制事件。其次,我们研究了广泛的基于BUSCO的系统发育中分类一致性的程度。对于543个测试的合适家族中的275个,与低速率位点相比,以较高速率进化的位点产生的分类一致的系统发育最多多23.84%,而末端可变的系统发育至少少46.15%。我们发现BUSCO串联树和合并树具有相当的准确性,并得出结论,串联比对中较高速率的位点产生的系统发育最一致且变异性最小。最后,我们表明未检测到但普遍存在的BUSCO基因丢失事件会导致对组装质量的错误表征。为了克服这一点,我们筛选了一组经过策划的BUSCO(CUSCO),与标准搜索相比,其误报率降低了6.99%,并引入了使用基因共线性比较组装的新方法。
总体而言,我们强调在组装评估过程中考虑进化历史的重要性,并发布了phyca软件工具包,该工具包可重建一致的系统发育并提供更精确的组装评估。