Chen Xiaoyu, Hughes Timothy R, Morris Quaid
Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada.
Bioinformatics. 2007 Jul 1;23(13):i72-9. doi: 10.1093/bioinformatics/btm224.
The sequence specificity of DNA-binding proteins is typically represented as a position weight matrix in which each base position contributes independently to relative affinity. Assessment of the accuracy and broad applicability of this representation has been limited by the lack of extensive DNA-binding data. However, new microarray techniques, in which preferences for all possible K-mers are measured, enable a broad comparison of both motif representation and methods for motif discovery. Here, we consider the problem of accounting for all of the binding data in such experiments, rather than the highest affinity binding data. We introduce the RankMotif++, an algorithm designed for finding motifs whenever sequences are associated with a semi-quantitative measure of protein-DNA-binding affinity. RankMotif++ learns motif models by maximizing the likelihood of a set of binding preferences under a probabilistic model of how sequence binding affinity translates into binding preference observations. Because RankMotif++ makes few assumptions about the relationship between binding affinity and the semi-quantitative readout, it is applicable to a wide variety of experimental assays of DNA-binding preference.
By several criteria, RankMotif++ predicts binding affinity better than two widely used motif finding algorithms (MDScan, MatrixREDUCE) or more recently developed algorithms (PREGO, Seed and Wobble), and its performance is comparable to a motif model that separately assigns affinities to 8-mers. Our results validate the PWM model and provide an approximation of the precision and recall that can be expected in a genomic scan.
RankMotif++ is available upon request.
Supplementary data are available at Bioinformatics online.
DNA结合蛋白的序列特异性通常用位置权重矩阵表示,其中每个碱基位置对相对亲和力的贡献是独立的。由于缺乏广泛的DNA结合数据,对这种表示方法的准确性和广泛适用性的评估受到了限制。然而,新的微阵列技术可以测量对所有可能的K-mer的偏好,从而能够对基序表示和基序发现方法进行广泛比较。在这里,我们考虑的问题是如何处理此类实验中的所有结合数据,而不是最高亲和力的结合数据。我们引入了RankMotif++算法,该算法设计用于在序列与蛋白质-DNA结合亲和力的半定量测量相关联时发现基序。RankMotif++通过在序列结合亲和力如何转化为结合偏好观察的概率模型下最大化一组结合偏好的似然性来学习基序模型。由于RankMotif++对结合亲和力和半定量读数之间的关系假设较少,因此它适用于各种DNA结合偏好的实验分析。
通过几个标准,RankMotif++比两种广泛使用的基序发现算法(MDScan、MatrixREDUCE)或最近开发的算法(PREGO、Seed和Wobble)能更好地预测结合亲和力,其性能与分别为8-mer分配亲和力的基序模型相当。我们的结果验证了PWM模型,并提供了基因组扫描中预期的精度和召回率的近似值。
可根据要求提供RankMotif++。
补充数据可在《生物信息学》在线获取。