Zhu Yixin, Watson Corey, Safonova Yana, Pennell Matt, Bankevich Anton
Department of Quantitative and Computational Biology and Biological Sciences, University of Southern California, Los Angeles, CA, United States.
Department of Biochemistry and Molecular Biology, University of Louisville School of Medicine, Louisville, KY, United States.
bioRxiv. 2024 Aug 2:2024.07.19.604360. doi: 10.1101/2024.07.19.604360.
Long-read sequencing technologies have revolutionized genome assembly producing near-complete chromosome assemblies for numerous organisms, which are invaluable to research in many fields. However, regions with complex repetitive structure continue to represent a challenge for genome assembly algorithms, particularly in areas with high heterozygosity. Robust and comprehensive solutions for the assessment of assembly accuracy and completeness in these regions do not exist. In this study we focus on the assembly of biomedically important antibody-encoding immunoglobulin (IG) loci, which are characterized by complex duplications and repeat structures. High-quality full-length assemblies for these loci are critical for resolving haplotype-level annotations of IG genes, without which, functional and evolutionary studies of antibody immunity across vertebrates are not tractable. To address these challenges, we developed a pipeline, "CloseRead", that generates multiple assembly verification metrics for analysis and visualization. These metrics expand upon those of existing quality assessment tools and specifically target complex and highly heterozygous regions. Using CloseRead, we systematically assessed the accuracy and completeness of IG loci in publicly available assemblies of 74 vertebrate species, identifying problematic regions. We also demonstrated that inspecting assembly graphs for problematic regions can both identify the root cause of assembly errors and illuminate solutions for improving erroneous assemblies. For a subset of species, we were able to correct assembly errors through targeted reassembly. Together, our analysis demonstrated the utility of assembly assessment in improving the completeness and accuracy of IG loci across species.
长读长测序技术彻底改变了基因组组装方式,为众多生物生成了近乎完整的染色体组装结果,这对许多领域的研究都非常宝贵。然而,具有复杂重复结构的区域仍然是基因组组装算法面临的挑战,特别是在杂合度高的区域。目前不存在用于评估这些区域组装准确性和完整性的强大而全面的解决方案。在本研究中,我们专注于生物医学上重要的抗体编码免疫球蛋白(IG)基因座的组装,这些基因座具有复杂的重复和重复结构。这些基因座的高质量全长组装对于解析IG基因的单倍型水平注释至关重要,没有这些注释,跨脊椎动物的抗体免疫功能和进化研究就难以进行。为了应对这些挑战,我们开发了一个名为“CloseRead”的流程,该流程生成多个组装验证指标用于分析和可视化。这些指标在现有质量评估工具的基础上进行了扩展,专门针对复杂和高度杂合的区域。使用CloseRead,我们系统地评估了74种脊椎动物公开可用组装中IG基因座的准确性和完整性,识别出有问题的区域。我们还证明,检查有问题区域的组装图既可以确定组装错误的根本原因,也可以阐明改进错误组装的解决方案。对于一部分物种,我们能够通过有针对性的重新组装来纠正组装错误。总之,我们的分析证明了组装评估在提高跨物种IG基因座的完整性和准确性方面的效用。