Mastoras Mira, Asri Mobin, Brambrink Lucas, Hebbar Prajna, Kolesnikov Alexey, Cook Daniel E, Nattestad Maria, Lucas Julian, Won Taylor S, Chang Pi-Chuan, Carroll Andrew, Paten Benedict, Shafin Kishwar
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
Google Inc, Mountain View, CA, USA.
bioRxiv. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505.
Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over-and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
准确的基因组组装对于生物学研究至关重要,但即使是质量最高的组装也会保留因用于构建它们的技术而产生的错误。碱基水平的错误通常通过额外的优化步骤来修复,该步骤使用与草图组装比对的 reads 来识别必要的编辑。然而,当前的方法难以在过度优化和优化不足之间找到平衡。在这里,我们提出了一种用于组装优化的仅编码器的变压器模型,称为 DeepPolisher,它使用 Pacbio HiFi reads 与二倍体组装的比对来预测对基础序列的校正。我们的流程引入了一种方法,即 PHARAOH(纯合区域的 reads 定相),它使用超长的 ONT 数据来确保比对准确地定相,并在错误的纯合区域正确引入杂合编辑。我们证明,DeepPolisher 流程可以将组装错误减少一半,插入缺失错误减少超过 70%。我们已经将基于 DeepPolisher 的流程应用于来自下一代人类泛基因组参考联盟(HPRC)数据发布的 180 个组装,对于大多数基因组,平均预测质量值(QV)提高了 3.4(错误减少 54%)。