Huang Neng, Nie Fan, Ni Peng, Luo Feng, Gao Xin, Wang Jianxin
School of Computer Science and Engineering, Central South University, Changsha 410083, China.
Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China.
Bioinformatics. 2021 Oct 11;37(19):3120-3127. doi: 10.1093/bioinformatics/btab354.
Oxford Nanopore sequencing producing long reads at low cost has made many breakthroughs in genomics studies. However, the large number of errors in Nanopore genome assembly affect the accuracy of genome analysis. Polishing is a procedure to correct the errors in genome assembly and can improve the reliability of the downstream analysis. However, the performances of the existing polishing methods are still not satisfactory.
We developed a novel polishing method, NeuralPolish, to correct the errors in assemblies based on alignment matrix construction and orthogonal Bi-GRU networks. In this method, we designed an alignment feature matrix for representing read-to-assembly alignment. Each row of the matrix represents a read, and each column represents the aligned bases at each position of the contig. In the network architecture, a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. After that, the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution. Finally, a CTC decoder generates a polished sequence with a greedy algorithm. We used five real datasets and three assembly tools including Wtdbg2, Flye and Canu for testing, and compared the results of different polishing methods including NeuralPolish, Racon, MarginPolish, HELEN and Medaka. Comprehensive experiments demonstrate that NeuralPolish achieves more accurate assembly with fewer errors than other polishing methods and can improve the accuracy of assembly obtained by different assemblers.
https://github.com/huangnengCSU/NeuralPolish.git.
Supplementary data are available at Bioinformatics online.
牛津纳米孔测序以低成本产生长读长,在基因组学研究中取得了许多突破。然而,纳米孔基因组组装中存在的大量错误影响了基因组分析的准确性。优化是一种纠正基因组组装错误的过程,可以提高下游分析的可靠性。然而,现有优化方法的性能仍不尽人意。
我们开发了一种新的优化方法NeuralPolish,基于比对矩阵构建和正交双向门控循环单元(Bi-GRU)网络来纠正组装中的错误。在这种方法中,我们设计了一个比对特征矩阵来表示 reads 与组装序列的比对。矩阵的每一行代表一个 read,每一列代表重叠群(contig)每个位置上的比对碱基。在网络架构中,双向门控循环单元网络用于通过逐行处理比对矩阵来提取每个 read 内部的序列信息。之后,特征矩阵由另一个双向门控循环单元网络逐列处理以计算概率分布。最后,一个连接时序分类(CTC)解码器使用贪婪算法生成优化后的序列。我们使用了五个真实数据集和包括Wtdbg2、Flye和Canu在内的三种组装工具进行测试,并比较了不同优化方法(包括NeuralPolish、Racon、MarginPolish、HELEN和Medaka)的结果。综合实验表明,NeuralPolish比其他优化方法能以更少的错误实现更准确的组装,并且可以提高不同组装器获得的组装准确性。
https://github.com/huangnengCSU/NeuralPolish.git。
补充数据可在《生物信息学》在线获取。