分相基因组组装中的缺口和复杂结构变异位点。

Gaps and complex structurally variant loci in phased genome assemblies.

机构信息

Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.

Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany.

出版信息

Genome Res. 2023 Apr;33(4):496-510. doi: 10.1101/gr.277334.122. Epub 2023 May 10.

DOI:10.1101/gr.277334.122

PMID:37164484

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10234299/

Abstract

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.

摘要

通过将长读数据与亲本信息或链接读数据相结合，在分阶段基因组组装生产方面取得了巨大进展。然而，通过 trio-hifiasm 生成的典型相位基因组仍然会产生超过 140 个缺口。我们对来自 77 个独特人类样本多样性面板的 182 个单体组装获得的缺口、组装断裂和定向错误进行了详细分析。尽管基于 trio 的使用 HiFi 的方法是当前的黄金标准，但使用 Strand-seq 而不是亲本数据时，染色体级别的相位准确性相当。重要的是，大多数组装缺口聚集在最大和最相似的重复序列附近（包括片段重复[35.4%]、卫星 DNA [22.3%]或富含 GA/AT 丰富 DNA 的区域[27.4%]）。因此，至少有 1513 个蛋白编码基因在至少一个单体型中重叠组装缺口，并且有 231 个基因经常从五个或更多单体型中断裂或缺失。此外，我们估计每个单体型有 6-7 Mbp 的 DNA 定向错误，无论是否使用无 trio 或基于 trio 的方法。在这些定向错误中，81%对应于人类物种中真正的大型倒位多态性，其中大多数被大片段重复序列包围。我们还确定了与每个单体基因组 11.9 Mbp 的缺失和 161.4 Mbp 的插入相一致的大规模对齐不连续性。尽管这种变异的 99%对应于卫星 DNA，但我们鉴定出 230 个常染色质 DNA 区域具有频繁的扩展和收缩，其中近一半与 197 个蛋白编码基因重叠。这种可变的和不完全组装的区域是未来算法开发和泛基因组表示的重要目标。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8d5e/10234299/c7e957330046/496f01.jpg

相似文献

Gaps and complex structurally variant loci in phased genome assemblies.

Genome Res. 2023 Apr;33(4):496-510. doi: 10.1101/gr.277334.122. Epub 2023 May 10.

Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.

Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.

Beckwith-Wiedemann Syndrome

Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.

Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.

Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.

The Black Book of Psychotropic Dosing and Monitoring.

Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Interventions for promoting habitual exercise in people living with and beyond cancer.

Cochrane Database Syst Rev. 2018 Sep 19;9(9):CD010192. doi: 10.1002/14651858.CD010192.pub3.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Short-Term Memory Impairment

引用本文的文献

Segmental duplication-mediated rearrangements alter the landscape of mouse genomes.

bioRxiv. 2025 Jul 22:2025.07.18.665526. doi: 10.1101/2025.07.18.665526.

Complex genetic variation in nearly complete human genomes.

Nature. 2025 Jul 23. doi: 10.1038/s41586-025-09140-6.

Genetic variation in recalcitrant repetitive regions of the genome.

Genome Res. 2025 Aug 5. doi: 10.1101/gr.280728.125.

Accurate short-read alignment through -index-based pangenome indexing.

Genome Res. 2025 Jul 1;35(7):1609-1620. doi: 10.1101/gr.279858.124.

Sequencing the gaps: dark genomic regions persist in CHM13 despite long-read advances.

bioRxiv. 2025 May 28:2025.05.23.655776. doi: 10.1101/2025.05.23.655776.

Genetic diversity and regulatory features of human-specific duplications.

bioRxiv. 2025 Mar 17:2025.03.14.643395. doi: 10.1101/2025.03.14.643395.

Unraveling undiagnosed rare disease cases by HiFi long-read genome sequencing.

Genome Res. 2025 Apr 14;35(4):755-768. doi: 10.1101/gr.279414.124.

Genome-wide profiling of highly similar paralogous genes using HiFi sequencing.

Nat Commun. 2025 Mar 8;16(1):2340. doi: 10.1038/s41467-025-57505-2.

Structural variation, selection, and diversification of the gene family from the human pangenome.

bioRxiv. 2025 Feb 5:2025.02.04.636496. doi: 10.1101/2025.02.04.636496.

A refined analysis of Neanderthal-introgressed sequences in modern humans with a complete reference genome.

Genome Biol. 2025 Feb 17;26(1):32. doi: 10.1186/s13059-025-03502-z.

本文引用的文献

A draft human pangenome reference.

Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.

Recombination between heterologous human acrocentric chromosomes.

Nature. 2023 May;617(7960):335-343. doi: 10.1038/s41586-023-05976-y. Epub 2023 May 10.

Increased mutation and gene conversion within human segmental duplications.

Nature. 2023 May;617(7960):325-334. doi: 10.1038/s41586-023-05895-y. Epub 2023 May 10.

Telomere-to-telomere assembly of diploid chromosomes with Verkko.

Nat Biotechnol. 2023 Oct;41(10):1474-1482. doi: 10.1038/s41587-023-01662-6. Epub 2023 Feb 16.

Semi-automated assembly of high-quality diploid human reference genomes.

Nature. 2022 Nov;611(7936):519-531. doi: 10.1038/s41586-022-05325-5. Epub 2022 Oct 19.

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.

Cell. 2022 Sep 1;185(18):3426-3440.e19. doi: 10.1016/j.cell.2022.08.004.

Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders.

Cell. 2022 May 26;185(11):1986-2005.e26. doi: 10.1016/j.cell.2022.04.017. Epub 2022 May 6.

The Human Pangenome Project: a global resource to map genomic diversity.

Nature. 2022 Apr;604(7906):437-446. doi: 10.1038/s41586-022-04601-8. Epub 2022 Apr 20.

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes.

Nat Genet. 2022 Apr;54(4):518-525. doi: 10.1038/s41588-022-01043-w. Epub 2022 Apr 11.

The complete sequence of a human genome.

Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

分相基因组组装中的缺口和复杂结构变异位点。

Gaps and complex structurally variant loci in phased genome assemblies.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献