Suppr超能文献

一种用于在DNA序列中寻找未知长度信号的算法。

An algorithm for finding signals of unknown length in DNA sequences.

作者信息

Pavesi G, Mauri G, Pesole G

机构信息

Department of Computer Science, Systems and Communication, University of Milan-Bicocca, Via Bicocca degli Arcimboldi 8, Milan, I-20126, Italy.

出版信息

Bioinformatics. 2001;17 Suppl 1:S207-14. doi: 10.1093/bioinformatics/17.suppl_1.s207.

Abstract

Pattern discovery in unaligned DNA sequences is a challenging problem in both computer science and molecular biology. Several different methods and techniques have been proposed so far, but in most of the cases signals in DNA sequences are very complicated and avoid detection. Exact exhaustive methods can solve the problem only for short signals with a limited number of mutations. In this work, we extend exhaustive enumeration also to longer patterns. More in detail, the basic version of algorithm presented in this paper, given as input a set of sequences and an error ratio epsilon < 1, finds all patterns that occur in at least q sequences of the set with at most epsilonm mutations, where m is the length of the pattern. The only restriction is imposed on the location of mutations along the signal. That is, a valid occurrence of a pattern can present at most [epsiloni] mismatches in the first i nucleotides, and so on. However, we show how the algorithm can be used also when no assumption can be made on the position of mutations. In this case, it is also possible to have an estimate of the probability of finding a signal according to the signal length, the error ratio, and the input parameters. Finally, we discuss some significance measures that can be used to sort the patterns output by the algorithm.

摘要

在未比对的DNA序列中发现模式是计算机科学和分子生物学领域中一个具有挑战性的问题。到目前为止,已经提出了几种不同的方法和技术,但在大多数情况下,DNA序列中的信号非常复杂,难以检测。精确的穷举方法只能解决具有有限数量突变的短信号问题。在这项工作中,我们将穷举枚举扩展到更长的模式。更详细地说,本文提出的算法的基本版本,以一组序列和一个误差率ε<1作为输入,找到在该集合中至少q个序列中出现且最多有εm个突变的所有模式,其中m是模式的长度。唯一的限制是对沿着信号的突变位置施加的。也就是说,模式的有效出现最多在前i个核苷酸中有[εi]个错配,依此类推。然而,我们展示了在无法对突变位置做出任何假设的情况下如何使用该算法。在这种情况下,还可以根据信号长度、误差率和输入参数来估计找到信号的概率。最后,我们讨论了一些可以用于对算法输出的模式进行排序的显著性度量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验