Magdy Mohamed Abdelaziz Barakat Sherif, Sallehuddin Roselina, Yuhaniz Siti Sophiayati, R Khairuddin Raja Farhana, Mahmood Yasir
Computer Science, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia.
Advanced Informatics Department, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Kuala Lumpur, Malaysia.
PeerJ Comput Sci. 2023 Jul 13;9:e1180. doi: 10.7717/peerj-cs.1180. eCollection 2023.
The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge.
The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization.
Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach.
We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
测序技术的发展增加了进行测序的基因组数量。然而,在基因组组装中,通过组装大量存在重复序列(重复片段)的短序列(读段)来获得高质量的基因组序列仍然是一项挑战。用于基因组组装的计算机算法从读段构建整个基因组有两种方法。重叠法基于读段后缀 - 前缀之间的精确匹配(重叠)来连接读段。参考引导法根据读段在已知参考基因组中的偏移量来对读段进行排序(读段比对)。重复片段的存在增加了技术上的模糊性,使得算法无法区分读段,从而导致错误组装并影响组装方法的准确性。另一方面,大量的读段给组装性能带来了巨大挑战。
引入重复片段识别方法来解决错误组装问题,通过预先识别重复序列创建一个重复片段知识库,以减少组装过程中的模糊性,从而提高组装基因组的准确性。此外,组装方法之间的杂交在参考基因组的帮助下降低了错误组装程度。通过数据结构索引和并行化来优化组装性能。本文的主要目的和贡献是通过广泛的综述为研究人员提供支持,以方便其他研究人员查找基因组组装研究。该研究还突出了基因组组装准确性和性能优化方面的最新进展和局限性。
我们的研究结果表明现有重复片段识别方法存在局限性,这些方法仅能检测特定长度的重复片段,并且当基因组中存在多种类型的重复片段时可能表现不佳。我们还发现,大多数杂交组装方法,无论是以重叠法还是参考引导法开始,在处理重复序列时都存在一些局限性,因为其计算成本更高且耗时更长。尽管发现杂交方法优于单独的组装方法,但优化其性能仍然是一项挑战。此外,在基因组组装的重叠和读段比对中使用并行化在杂交组装方法中尚未得到充分实现。
我们建议结合多种重复片段识别方法来提高识别重复片段的准确性,作为杂交组装方法的第一步,并将基因组索引与并行化相结合以更好地优化其性能。