一种用于在DNA序列中寻找未知长度信号的算法。

An algorithm for finding signals of unknown length in DNA sequences.

作者信息

Pavesi G, Mauri G, Pesole G

机构信息

Department of Computer Science, Systems and Communication, University of Milan-Bicocca, Via Bicocca degli Arcimboldi 8, Milan, I-20126, Italy.

出版信息

Bioinformatics. 2001;17 Suppl 1:S207-14. doi: 10.1093/bioinformatics/17.suppl_1.s207.

DOI:10.1093/bioinformatics/17.suppl_1.s207

PMID:11473011

Abstract

Pattern discovery in unaligned DNA sequences is a challenging problem in both computer science and molecular biology. Several different methods and techniques have been proposed so far, but in most of the cases signals in DNA sequences are very complicated and avoid detection. Exact exhaustive methods can solve the problem only for short signals with a limited number of mutations. In this work, we extend exhaustive enumeration also to longer patterns. More in detail, the basic version of algorithm presented in this paper, given as input a set of sequences and an error ratio epsilon < 1, finds all patterns that occur in at least q sequences of the set with at most epsilonm mutations, where m is the length of the pattern. The only restriction is imposed on the location of mutations along the signal. That is, a valid occurrence of a pattern can present at most [epsiloni] mismatches in the first i nucleotides, and so on. However, we show how the algorithm can be used also when no assumption can be made on the position of mutations. In this case, it is also possible to have an estimate of the probability of finding a signal according to the signal length, the error ratio, and the input parameters. Finally, we discuss some significance measures that can be used to sort the patterns output by the algorithm.

摘要

在未比对的DNA序列中发现模式是计算机科学和分子生物学领域中一个具有挑战性的问题。到目前为止，已经提出了几种不同的方法和技术，但在大多数情况下，DNA序列中的信号非常复杂，难以检测。精确的穷举方法只能解决具有有限数量突变的短信号问题。在这项工作中，我们将穷举枚举扩展到更长的模式。更详细地说，本文提出的算法的基本版本，以一组序列和一个误差率ε<1作为输入，找到在该集合中至少q个序列中出现且最多有εm个突变的所有模式，其中m是模式的长度。唯一的限制是对沿着信号的突变位置施加的。也就是说，模式的有效出现最多在前i个核苷酸中有[εi]个错配，依此类推。然而，我们展示了在无法对突变位置做出任何假设的情况下如何使用该算法。在这种情况下，还可以根据信号长度、误差率和输入参数来估计找到信号的概率。最后，我们讨论了一些可以用于对算法输出的模式进行排序的显著性度量。

相似文献

An algorithm for finding signals of unknown length in DNA sequences.

Bioinformatics. 2001;17 Suppl 1:S207-14. doi: 10.1093/bioinformatics/17.suppl_1.s207.

SIMD parallelization of the WORDUP algorithm for detecting statistically significant patterns in DNA sequences.

Comput Appl Biosci. 1993 Dec;9(6):701-7. doi: 10.1093/bioinformatics/9.6.701.

An improved heuristic algorithm for finding motif signals in DNA sequences.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):959-75. doi: 10.1109/TCBB.2010.92.

Finding composite regulatory patterns in DNA sequences.

Bioinformatics. 2002;18 Suppl 1:S354-63. doi: 10.1093/bioinformatics/18.suppl_1.s354.

An O(N2) algorithm for discovering optimal Boolean pattern pairs.

IEEE/ACM Trans Comput Biol Bioinform. 2004 Oct-Dec;1(4):159-70. doi: 10.1109/TCBB.2004.36.

Combinatorial approaches to finding subtle signals in DNA sequences.

Proc Int Conf Intell Syst Mol Biol. 2000;8:269-78.

FASTPAT: a fast and efficient algorithm for string searching in DNA sequences.

Comput Appl Biosci. 1993 Oct;9(5):541-5. doi: 10.1093/bioinformatics/9.5.541.

A hybrid method for the exact planted (l, d) motif finding problem and its parallelization.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S10. doi: 10.1186/1471-2105-13-S17-S10. Epub 2012 Dec 13.

Using suffix tree to discover complex repetitive patterns in DNA sequences.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:3474-7. doi: 10.1109/IEMBS.2006.260445.

Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm.

Comput Appl Biosci. 1996 Feb;12(1):71-80. doi: 10.1093/bioinformatics/12.1.71.

引用本文的文献

Peak Scores Significantly Depend on the Relationships between Contextual Signals in ChIP-Seq Peaks.

Int J Mol Sci. 2024 Jan 13;25(2):1011. doi: 10.3390/ijms25021011.

A survey on algorithms to characterize transcription factor binding sites.

Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad156.

Deciphering Macromolecular Interactions Involved in Abiotic Stress Signaling: A Review of Bioinformatics Analysis.

Methods Mol Biol. 2023;2642:257-294. doi: 10.1007/978-1-0716-3044-0_15.

A survey on deep learning in DNA/RNA motif mining.

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa229.

Transcription factor expression defines subclasses of developing projection neurons highly similar to single-cell RNA-seq subtypes.

Proc Natl Acad Sci U S A. 2020 Oct 6;117(40):25074-25084. doi: 10.1073/pnas.2008013117. Epub 2020 Sep 18.

Hepatocyte nuclear factor-1β regulates Wnt signaling through genome-wide competition with β-catenin/lymphoid enhancer binding factor.

Proc Natl Acad Sci U S A. 2019 Nov 26;116(48):24133-24142. doi: 10.1073/pnas.1909452116. Epub 2019 Nov 11.

Review of Different Sequence Motif Finding Algorithms.

Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.

SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing.

Bioinformatics. 2019 Oct 15;35(20):3944-3952. doi: 10.1093/bioinformatics/btz198.

Shared cis-regulatory architecture identified across defense response genes is associated with broad-spectrum quantitative resistance in rice.

Sci Rep. 2019 Feb 7;9(1):1536. doi: 10.1038/s41598-018-38195-x.

Post-translational modification localizes MYC to the nuclear pore basket to regulate a subset of target genes involved in cellular responses to environmental signals.

Genes Dev. 2018 Nov 1;32(21-22):1398-1419. doi: 10.1101/gad.314377.118. Epub 2018 Oct 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于在DNA序列中寻找未知长度信号的算法。

An algorithm for finding signals of unknown length in DNA sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献