Brazma A, Jonassen I, Eidhammer I, Gilbert D
EMBL Outstation-Hinxton, European Bioinformatics Institute, Cambridge, UK.
J Comput Biol. 1998 Summer;5(2):279-305. doi: 10.1089/cmb.1998.5.279.
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns. It is shown that the problem is related to problems studied in the field of machine learning. The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.
本文综述了在生物序列中发现模式的方法,并将这些方法置于一个形式框架内,该框架对模式类型和发现算法进行了系统化。文中考虑了在正则语言类中具有表达能力的模式,并对该类中的模式语言进行了分类,涵盖了分子生物信息学中最常用的模式。给出了从一组序列中自动发现此类模式的问题的表述,并分析了评估所发现模式的显著性的方法。结果表明,该问题与机器学习领域所研究的问题相关。本文的主要部分包括对为解决该问题而开发的一些现有方法及其相互关系的综述,重点关注这些方法背后的算法。对这些算法进行了比较,并给出了使用不同方法发现的模式的示例。