Sandve Geir Kjetil, Abul Osman, Drabløs Finn
Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway.
BMC Bioinformatics. 2008 Dec 8;9:527. doi: 10.1186/1471-2105-9-527.
Computational discovery of motifs in biomolecular sequences is an established field, with applications both in the discovery of functional sites in proteins and regulatory sites in DNA. In recent years there has been increased attention towards the discovery of composite motifs, typically occurring in cis-regulatory regions of genes.
This paper describes Compo: a discrete approach to composite motif discovery that supports richer modeling of composite motifs and a more realistic background model compared to previous methods. Furthermore, multiple parameter and threshold settings are tested automatically, and the most interesting motifs across settings are selected. This avoids reliance on single hard thresholds, which has been a weakness of previous discrete methods. Comparison of motifs across parameter settings is made possible by the use of p-values as a general significance measure. Compo can either return an ordered list of motifs, ranked according to the general significance measure, or a Pareto front corresponding to a multi-objective evaluation on sensitivity, specificity and spatial clustering.
Compo performs very competitively compared to several existing methods on a collection of benchmark data sets. These benchmarks include a recently published, large benchmark suite where the use of support across sequences allows Compo to correctly identify binding sites even when the relevant PWMs are mixed with a large number of noise PWMs. Furthermore, the possibility of parameter-free running offers high usability, the support for multi-objective evaluation allows a rich view of potential regulators, and the discrete model allows flexibility in modeling and interpretation of motifs.
生物分子序列中基序的计算发现是一个成熟的领域,在蛋白质功能位点和DNA调控位点的发现中均有应用。近年来,人们越来越关注复合基序的发现,复合基序通常出现在基因的顺式调控区域。
本文描述了Compo:一种用于复合基序发现的离散方法,与以前的方法相比,它支持对复合基序进行更丰富的建模以及更现实的背景模型。此外,会自动测试多个参数和阈值设置,并选择各设置中最有趣的基序。这避免了依赖单一硬阈值,而这一直是以前离散方法的一个弱点。通过使用p值作为一般显著性度量,可以对不同参数设置下的基序进行比较。Compo既可以返回根据一般显著性度量排序的基序列表,也可以返回对应于对敏感性、特异性和空间聚类进行多目标评估的帕累托前沿。
在一组基准数据集上,Compo与几种现有方法相比具有很强的竞争力。这些基准包括最近发布的一个大型基准套件,在该套件中,跨序列使用支持使得Compo即使在相关位置权重矩阵(PWM)与大量噪声PWM混合的情况下也能正确识别结合位点。此外,无参数运行的可能性提供了高可用性,对多目标评估的支持允许对潜在调控因子有更全面的了解,并且离散模型在基序建模和解释方面具有灵活性。