RepAHR:通过组装高频读段进行从头鉴定重复序列的改进方法。

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

机构信息

School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China.

Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.

出版信息

BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.

Abstract

BACKGROUND

Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.

RESULTS

In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.

CONLUSIONS

We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

摘要

背景

重复序列在真核生物基因组中占有很大比例。识别重复序列在许多应用中起着重要作用,例如结构变异检测和基因组组装。许多现有的从头重复识别管道或工具都利用高频 k-mer 的组装来获取重复序列。然而,组装器需要一定程度的序列覆盖才能获得所需的组装。另一方面,组装器将读取片段切成较短的 k-mer 进行组装,这可能会破坏重复区域的结构。由于上述原因,很难通过现有的工具在基因组中获得完整和准确的重复区域。

结果

在本研究中,我们提出了一种新的方法,称为 RepAHR,用于通过高频读取的组装进行从头重复识别。首先,RepAHR 扫描下一代测序 (NGS) 读取以找到高频 k-mer。其次,RepAHR 根据高频 k-mer 基于某些规则从整个 NGS 读取中过滤高频读取。最后,使用被认为是具有 NGS 序列的出色基因组组装器的 SPAdes 对高频读取进行组装以生成重复序列。

结论

我们在五个数据集上测试了 RepAHR,实验结果表明,在检测重复序列方面,RepAHR 在 N50、参考比对率、参考覆盖率、Repbase 掩模率和其他一些指标方面均优于 RepARK 和 REPdenovo。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a41/7574428/729bec3cea69/12859_2020_3779_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索