Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.
Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.
Bioinformatics. 2019 Nov 1;35(22):4596-4606. doi: 10.1093/bioinformatics/btz262.
Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions.
Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads.
The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license.
Supplementary data are available at Bioinformatics online.
检测包含重复区域的序列是具有许多应用的基本生物信息学任务。已经开发了几种用于各种类型的重复检测任务的方法。仍然需要一种用于检测大多数类型的重复序列的高效通用方法。受 D2 统计家族在基因组序列比较分析中出色的特性和成功应用的启发,我们开发了一种新的统计量 D2R,它可以有效地区分具有或不具有重复区域的序列。
使用该统计量,我们开发了一种具有线性时间和空间复杂度的算法,用于在多种情况下检测大多数类型的重复序列,包括从细菌基因组或宏基因组序列中寻找候选簇状规则间隔短回文重复区。模拟和真实数据实验表明,该方法在组装序列和未组装的短读段上都能很好地工作。
代码可在 GPL 3.0 许可证下在 https://github.com/XuegongLab/D2R_codes 上获得。
补充数据可在 Bioinformatics 在线获得。