RepAHR：通过组装高频读段进行从头鉴定重复序列的改进方法。

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

机构信息

School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China.

Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.

出版信息

BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.

DOI:10.1186/s12859-020-03779-w

PMID:33076827

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7574428/

Abstract

BACKGROUND

Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.

RESULTS

In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.

CONLUSIONS

We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

摘要

背景

重复序列在真核生物基因组中占有很大比例。识别重复序列在许多应用中起着重要作用，例如结构变异检测和基因组组装。许多现有的从头重复识别管道或工具都利用高频 k-mer 的组装来获取重复序列。然而，组装器需要一定程度的序列覆盖才能获得所需的组装。另一方面，组装器将读取片段切成较短的 k-mer 进行组装，这可能会破坏重复区域的结构。由于上述原因，很难通过现有的工具在基因组中获得完整和准确的重复区域。

结果

在本研究中，我们提出了一种新的方法，称为 RepAHR，用于通过高频读取的组装进行从头重复识别。首先，RepAHR 扫描下一代测序 (NGS) 读取以找到高频 k-mer。其次，RepAHR 根据高频 k-mer 基于某些规则从整个 NGS 读取中过滤高频读取。最后，使用被认为是具有 NGS 序列的出色基因组组装器的 SPAdes 对高频读取进行组装以生成重复序列。

结论

我们在五个数据集上测试了 RepAHR，实验结果表明，在检测重复序列方面，RepAHR 在 N50、参考比对率、参考覆盖率、Repbase 掩模率和其他一些指标方面均优于 RepARK 和 REPdenovo。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a41/7574428/729bec3cea69/12859_2020_3779_Fig1_HTML.jpg

相似文献

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.RepAHR：通过组装高频读段进行从头鉴定重复序列的改进方法。

BMC Bioinformatics. 2020 Oct 19;21(1):463. doi: 10.1186/s12859-020-03779-w.

An improved approach for reconstructing consensus repeats from short sequence reads.一种从短序列读段中重构一致重复序列的改进方法。

BMC Genomics. 2018 Aug 13;19(Suppl 6):566. doi: 10.1186/s12864-018-4920-6.

A sensitive repeat identification framework based on short and long reads.基于短读长读的敏感重复序列识别框架。

Nucleic Acids Res. 2021 Sep 27;49(17):e100. doi: 10.1093/nar/gkab563.

RepARK--de novo creation of repeat libraries from whole-genome NGS reads.RepARK——从头创建来自全基因组 NGS 读取的重复文库。

Nucleic Acids Res. 2014 May;42(9):e80. doi: 10.1093/nar/gku210. Epub 2014 Mar 14.

RFGR: Repeat Finder for Complete and Assembled Whole Genomes and NGS Reads.RFGR：完整基因组和 NGS 读段的重复序列查找器。

Biochem Genet. 2024 Oct;62(5):4157-4173. doi: 10.1007/s10528-023-10628-x. Epub 2024 Jan 12.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

Improving de novo Assembly Based on Read Classification.基于读段分类的从头组装改进。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.

De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application.应用 dnaasm 对具有重复 DNA 区域的细菌基因组进行从头组装。

BMC Bioinformatics. 2018 Jul 18;19(1):273. doi: 10.1186/s12859-018-2281-4.

Assembly of highly repetitive genomes using short reads: the genome of discrete typing unit III Trypanosoma cruzi strain 231.使用短读长组装高度重复基因组：离散型别单元 III 号 Trypanosoma cruzi 株 231 的基因组。

Microb Genom. 2018 Apr;4(4). doi: 10.1099/mgen.0.000156. Epub 2018 Feb 14.

GAPPadder: a sensitive approach for closing gaps on draft genomes with short sequence reads.GAPPadder：一种使用短序列读长来闭合草图基因组缺口的灵敏方法。

BMC Genomics. 2019 Jun 6;20(Suppl 5):426. doi: 10.1186/s12864-019-5703-4.

引用本文的文献

Genome-Wide Tool for Sensitive de novo Identification and Visualisation of Interspersed and Tandem Repeats.用于敏感地从头鉴定和可视化散布重复序列和串联重复序列的全基因组工具。

Bioinform Biol Insights. 2024 Dec 18;18:11779322241306391. doi: 10.1177/11779322241306391. eCollection 2024.

Study of Dispersed Repeats in the Genome.基因组中分散重复序列的研究

Int J Mol Sci. 2024 Apr 18;25(8):4441. doi: 10.3390/ijms25084441.

Repetitive DNA sequence detection and its role in the human genome.重复 DNA 序列检测及其在人类基因组中的作用。

Commun Biol. 2023 Sep 19;6(1):954. doi: 10.1038/s42003-023-05322-y.

Methodologies for the Discovery of Transposable Element Families.转座元件家族发现方法学

Genes (Basel). 2022 Apr 17;13(4):709. doi: 10.3390/genes13040709.

msRepDB: a comprehensive repetitive sequence database of over 80 000 species.msRepDB：一个涵盖超过 80000 个物种的综合重复序列数据库。

Nucleic Acids Res. 2022 Jan 7;50(D1):D236-D245. doi: 10.1093/nar/gkab1089.

本文引用的文献

EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads.EPGA-SC：一种用于单细胞测序reads 从头组装的框架。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jul-Aug;18(4):1492-1503. doi: 10.1109/TCBB.2019.2945761. Epub 2021 Aug 6.

An Efficient Trimming Algorithm based on Multi-Feature Fusion Scoring Model for NGS Data.基于多特征融合评分模型的 NGS 数据高效修剪算法。

IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):728-738. doi: 10.1109/TCBB.2019.2897558. Epub 2019 Feb 5.

MEC: Misassembly Error Correction in contigs based on distribution of paired-end reads and statistics of GC-contents.MEC：基于双端读段分布和GC含量统计的重叠群错配错误校正

IEEE/ACM Trans Comput Biol Bioinform. 2018 Oct 18. doi: 10.1109/TCBB.2018.2876855.

Improving de novo Assembly Based on Read Classification.基于读段分类的从头组装改进。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):177-188. doi: 10.1109/TCBB.2018.2861380. Epub 2018 Jul 30.

RepLong: de novo repeat identification using long read sequencing data.RepLong：利用长读测序数据进行从头重复识别。

Bioinformatics. 2018 Apr 1;34(7):1099-1107. doi: 10.1093/bioinformatics/btx717.

Variant Review with the Integrative Genomics Viewer.使用综合基因组浏览器进行变异审查。

Cancer Res. 2017 Nov 1;77(21):e31-e34. doi: 10.1158/0008-5472.CAN-17-0337.

ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution.ISEA：利用双末端信息和插入片段大小分布进行从头组装的迭代种子扩展算法

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):916-925. doi: 10.1109/TCBB.2016.2550433. Epub 2016 Apr 5.

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.REPdenovo：从短序列读取中推断从头重复基序

PLoS One. 2016 Mar 15;11(3):e0150719. doi: 10.1371/journal.pone.0150719. eCollection 2016.

Hybrid de novo tandem repeat detection using short and long reads.使用短读长和长读长的混合从头串联重复序列检测

BMC Med Genomics. 2015;8 Suppl 3(Suppl 3):S5. doi: 10.1186/1755-8794-8-S3-S5. Epub 2015 Sep 23.

EPGA2: memory-efficient de novo assembler.EPGA2：内存高效的从头组装器。

Bioinformatics. 2015 Dec 15;31(24):3988-90. doi: 10.1093/bioinformatics/btv487. Epub 2015 Aug 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

RepAHR：通过组装高频读段进行从头鉴定重复序列的改进方法。

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

机构信息

出版信息

BACKGROUND

RESULTS

CONLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献