Sinha Saurabh
Center for Studies in Physics and Biology, Box 25, The Rockefeller University, New York, NY 10021, USA.
J Comput Biol. 2003;10(3-4):599-615. doi: 10.1089/10665270360688219.
This paper takes a new view of motif discovery, addressing a common problem in existing motif finders. A motif is treated as a feature of the input promoter regions that leads to a good classifier between these promoters and a set of background promoters. This perspective allows us to adapt existing methods of feature selection, a well-studied topic in machine learning, to motif discovery. We develop a general algorithmic framework that can be specialized to work with a wide variety of motif models, including consensus models with degenerate symbols or mismatches, and composite motifs. A key feature of our algorithm is that it measures overrepresentation while maintaining information about the distribution of motif instances in individual promoters. The assessment of a motif's discriminative power is normalized against chance behaviour by a probabilistic analysis. We apply our framework to two popular motif models and are able to detect several known binding sites in sets of co-regulated genes in yeast.
本文对基序发现提出了一种新观点,解决了现有基序查找器中的一个常见问题。基序被视为输入启动子区域的一个特征,该特征能在这些启动子与一组背景启动子之间产生良好的分类器。这种观点使我们能够将机器学习中一个经过充分研究的主题——现有特征选择方法应用于基序发现。我们开发了一个通用算法框架,该框架可以专门用于处理各种基序模型,包括具有简并符号或错配的共有模型以及复合基序。我们算法的一个关键特征是,它在测量过表达的同时,还能保留关于单个启动子中基序实例分布的信息。通过概率分析,将基序判别力的评估与随机行为进行归一化。我们将我们的框架应用于两种流行的基序模型,并能够在酵母中共同调控基因集中检测到几个已知的结合位点。