一种用于高效检测重复序列的新统计方法。

A new statistic for efficient detection of repetitive sequences.

机构信息

Department of Automation, MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, BNRist, Tsinghua University, Beijing 100084, China.

Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA.

出版信息

Bioinformatics. 2019 Nov 1;35(22):4596-4606. doi: 10.1093/bioinformatics/btz262.

DOI:10.1093/bioinformatics/btz262

PMID:30993316

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7963086/

Abstract

MOTIVATION

Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions.

RESULTS

Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads.

AVAILABILITY AND IMPLEMENTATION

The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

检测包含重复区域的序列是具有许多应用的基本生物信息学任务。已经开发了几种用于各种类型的重复检测任务的方法。仍然需要一种用于检测大多数类型的重复序列的高效通用方法。受 D2 统计家族在基因组序列比较分析中出色的特性和成功应用的启发，我们开发了一种新的统计量 D2R，它可以有效地区分具有或不具有重复区域的序列。

结果

使用该统计量，我们开发了一种具有线性时间和空间复杂度的算法，用于在多种情况下检测大多数类型的重复序列，包括从细菌基因组或宏基因组序列中寻找候选簇状规则间隔短回文重复区。模拟和真实数据实验表明，该方法在组装序列和未组装的短读段上都能很好地工作。

可用性和实现

代码可在 GPL 3.0 许可证下在 https://github.com/XuegongLab/D2R_codes 上获得。

补充信息

补充数据可在 Bioinformatics 在线获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种用于高效检测重复序列的新统计方法。

A new statistic for efficient detection of repetitive sequences.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

一种用于高效检测重复序列的新统计方法。

A new statistic for efficient detection of repetitive sequences.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息