阿波罗：一种与测序技术无关、可扩展且准确的组装后处理算法。

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.

机构信息

Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland.

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

出版信息

Bioinformatics. 2020 Jun 1;36(12):3669-3679. doi: 10.1093/bioinformatics/btaa179.

DOI:10.1093/bioinformatics/btaa179

PMID:32167530

Abstract

MOTIVATION

Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively.

RESULTS

We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/CMU-SAFARI/Apollo.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

第三代测序技术可以对长达 200 万个碱基对的长读段进行测序。这些长读段用于构建组装体（即主体基因组），然后在下游基因组分析中使用。不幸的是，第三代测序技术的测序错误率很高，并且这些长读段中有很大比例的碱基对被错误识别。这些错误会传播到组装体中，影响基因组分析的准确性。组装体抛光算法通过使用读段与组装体之间的比对（即读段与组装体比对信息）信息来抛光或修复组装体中的错误，从而最小化这种错误传播。然而，当前的组装体抛光算法只能使用来自特定测序技术或小组装体的读段来抛光组装体。这种技术依赖性和组装体大小依赖性要求研究人员 (i) 运行多个抛光算法，以及 (ii) 使用大基因组的小片段分别使用所有可用的读段集来抛光小基因组。

结果

我们引入了 Apollo，这是一种通用的组装体抛光算法，可以很好地扩展到使用来自所有测序技术（即第二代和第三代）的读段来抛光任意大小的组装体（即大基因组和小基因组）。我们的目标是提供一种单一的算法，使用来自所有可用测序技术的读段集来提高组装体抛光的准确性，并能够抛光大基因组。Apollo (i) 将组装体建模为一个轮廓隐马尔可夫模型 (pHMM)，(ii) 使用读段与组装体比对信息使用前向-后向算法来训练 pHMM，以及 (iii) 使用维特比算法对训练后的模型进行解码，以生成抛光的组装体。我们使用真实读段集进行的实验表明，Apollo 是唯一一种 (i) 在单个运行中使用来自任何测序技术的读段，以及 (ii) 可以很好地扩展到无需将组装体分割成多个部分即可抛光大型组装体的算法。