Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, United States.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad595.
Evaluating the gene completeness is critical to measuring the quality of a genome assembly. An incomplete assembly can lead to errors in gene predictions, annotation, and other downstream analyses. Benchmarking Universal Single-Copy Orthologs (BUSCO) is a widely used tool for assessing the completeness of genome assembly by testing the presence of a set of single-copy orthologs conserved across a wide range of taxa. However, BUSCO is slow particularly for large genome assemblies. It is cumbersome to apply BUSCO to a large number of assemblies.
Here, we present compleasm, an efficient tool for assessing the completeness of genome assemblies. Compleasm utilizes the miniprot protein-to-genome aligner and the conserved orthologous genes from BUSCO. It is 14 times faster than BUSCO for human assemblies and reports a more accurate completeness of 99.6% than BUSCO's 95.7%, which is in close agreement with the annotation completeness of 99.5% for T2T-CHM13.
评估基因完整性对于衡量基因组组装的质量至关重要。不完整的组装可能导致基因预测、注释和其他下游分析的错误。基准通用单拷贝同源物 (BUSCO) 是一种广泛用于评估基因组组装完整性的工具,通过测试一组在广泛的分类群中保守的单拷贝同源物的存在来实现。然而,BUSCO 对于大型基因组组装来说特别慢。将 BUSCO 应用于大量组装是很繁琐的。
在这里,我们提出了 compleasm,这是一种用于评估基因组组装完整性的高效工具。Compleasm 利用 miniprot 蛋白质到基因组比对器和 BUSCO 的保守直系同源基因。对于人类组装,它比 BUSCO 快 14 倍,报告的完整性为 99.6%,比 BUSCO 的 95.7%更准确,与 T2T-CHM13 的注释完整性 99.5%非常吻合。