使用深度抛光机进行高精度装配抛光。

Highly accurate assembly polishing with DeepPolisher.

作者信息

Mastoras Mira, Asri Mobin, Brambrink Lucas, Hebbar Prajna, Kolesnikov Alexey, Cook Daniel E, Nattestad Maria, Lucas Julian, Won Taylor S, Chang Pi-Chuan, Carroll Andrew, Paten Benedict, Shafin Kishwar

机构信息

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.

Google Inc, Mountain View, CA, USA.

出版信息

bioRxiv. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505.

DOI:10.1101/2024.09.17.613505

PMID:39345401

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11429912/

Abstract

Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over-and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.

摘要

准确的基因组组装对于生物学研究至关重要，但即使是质量最高的组装也会保留因用于构建它们的技术而产生的错误。碱基水平的错误通常通过额外的优化步骤来修复，该步骤使用与草图组装比对的 reads 来识别必要的编辑。然而，当前的方法难以在过度优化和优化不足之间找到平衡。在这里，我们提出了一种用于组装优化的仅编码器的变压器模型，称为 DeepPolisher，它使用 Pacbio HiFi reads 与二倍体组装的比对来预测对基础序列的校正。我们的流程引入了一种方法，即 PHARAOH（纯合区域的 reads 定相），它使用超长的 ONT 数据来确保比对准确地定相，并在错误的纯合区域正确引入杂合编辑。我们证明，DeepPolisher 流程可以将组装错误减少一半，插入缺失错误减少超过 70%。我们已经将基于 DeepPolisher 的流程应用于来自下一代人类泛基因组参考联盟（HPRC）数据发布的 180 个组装，对于大多数基因组，平均预测质量值（QV）提高了 3.4（错误减少 54%）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/59ab/11429912/981a1f122fd3/nihpp-2024.09.17.613505v1-f0001.jpg

相似文献

Highly accurate assembly polishing with DeepPolisher.使用深度抛光机进行高精度装配抛光。

bioRxiv. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505.

Highly accurate assembly polishing with DeepPolisher.使用深度抛光机进行高精度装配抛光。

Genome Res. 2025 Jul 1;35(7):1595-1608. doi: 10.1101/gr.280149.124.

Automated devices for identifying peripheral arterial disease in people with leg ulceration: an evidence synthesis and cost-effectiveness analysis.用于识别下肢溃疡患者外周动脉疾病的自动化设备：证据综合和成本效益分析。

Health Technol Assess. 2024 Aug;28(37):1-158. doi: 10.3310/TWCG3912.

Short-Term Memory Impairment短期记忆障碍

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Analysis of targeted and whole genome sequencing of PacBio HiFi reads for a comprehensive genotyping of gene-proximal and phenotype-associated Variable Number Tandem Repeats.对PacBio HiFi reads进行靶向和全基因组测序分析，以全面基因分型基因近端和表型相关的可变数目串联重复序列。

PLoS Comput Biol. 2025 Apr 7;21(4):e1012885. doi: 10.1371/journal.pcbi.1012885. eCollection 2025 Apr.

Electronic cigarettes for smoking cessation and reduction.用于戒烟和减少吸烟量的电子烟。

Cochrane Database Syst Rev. 2014(12):CD010216. doi: 10.1002/14651858.CD010216.pub2. Epub 2014 Dec 17.

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理：一项网络荟萃分析。

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力：系统评价和网络荟萃分析。

Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.

本文引用的文献

Accurate human genome analysis with element avidity sequencing.利用元件亲和力测序进行准确的人类基因组分析。

BMC Bioinformatics. 2025 Jul 25;26(1):194. doi: 10.1186/s12859-025-06191-4.

NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads.NextPolish2：一种针对使用 HiFi 长读长组装的基因组进行重复感知优化的工具。

Genomics Proteomics Bioinformatics. 2024 May 9;22(1). doi: 10.1093/gpbjnl/qzad009.

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph.使用双图进行二倍体和多倍体基因组的可扩展端粒到端粒组装。

Nat Methods. 2024 Jun;21(6):967-970. doi: 10.1038/s41592-024-02269-8. Epub 2024 May 10.

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References.超越人类基因组计划：完整人类基因组序列和泛基因组参考时代。

Annu Rev Genomics Hum Genet. 2024 Aug;25(1):77-104. doi: 10.1146/annurev-genom-021623-081639. Epub 2024 Aug 6.

Genome assembly in the telomere-to-telomere era.端粒到端粒时代的基因组组装。

Nat Rev Genet. 2024 Sep;25(9):658-670. doi: 10.1038/s41576-024-00718-w. Epub 2024 Apr 22.

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation.可扩展的纳米孔测序技术对人类基因组进行测序，提供了全面的单倍型分辨率变异和甲基化视图。

Nat Methods. 2023 Oct;20(10):1483-1492. doi: 10.1038/s41592-023-01993-x. Epub 2023 Sep 14.

A draft human pangenome reference.人类泛基因组参考草图。

Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.

Applications of transformer-based language models in bioinformatics: a survey.基于Transformer的语言模型在生物信息学中的应用：一项综述。

Bioinform Adv. 2023 Jan 11;3(1):vbad001. doi: 10.1093/bioadv/vbad001. eCollection 2023.

Telomere-to-telomere assembly of diploid chromosomes with Verkko.利用 Verkko 进行二倍体染色体的端粒到端粒组装。

Nat Biotechnol. 2023 Oct;41(10):1474-1482. doi: 10.1038/s41587-023-01662-6. Epub 2023 Feb 16.

Benchmarking challenging small variants with linked and long reads.使用连锁读段和长读段对具有挑战性的小变异进行基准测试。

Cell Genom. 2022 May;2(5). doi: 10.1016/j.xgen.2022.100128.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用深度抛光机进行高精度装配抛光。

Highly accurate assembly polishing with DeepPolisher.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献