Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China.
Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad264.
Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.
获得准确的病毒基因组对于下游数据分析很重要。第三代测序(TGS)由于其长读长,最近已成为病毒测序的流行平台。然而,其碱基错误率高于下一代测序,这可能导致基因组存在错误。因此,需要使用润色工具在序列组装之前或之后纠正错误。尽管现有的润色工具取得了有希望的结果,但仍有改进错误纠正性能的空间,以实现更准确的基因组组装。这些错误,特别是编码区的错误,会阻碍谱系鉴定和变异监测等分析。在这项工作中,我们开发了一种新的流水线 HMMPolish,用于纠正(润色)已知 RNA 病毒的编码区中的错误。该工具可应用于原始 TGS 读段或目标病毒的组装序列。通过利用已知病毒中蛋白质家族/结构域的 Profile Hidden Markov Models,HMMPolish 可以纠正现有润色工具忽略的错误。我们在涵盖 HIV-1、流感 A、诺如病毒和严重急性呼吸综合征冠状病毒 2 等四种临床重要病毒的 34 个数据集上对 HMMPolish 进行了广泛验证。这些数据集包含具有不同特性的读段,例如测序深度和平台(PacBio 或 Nanopore)。与流行/代表性润色工具的基准测试结果表明,HMMPolish 在已知 RNA 病毒编码区的错误纠正方面具有竞争力。