MRC Laboratory of Molecular Biology, Cambridge, United Kingdom.
PLoS Comput Biol. 2010 Dec 2;6(12):e1001020. doi: 10.1371/journal.pcbi.1001020.
Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods.
计算方法试图在基因组中识别顺式调控模块(CRMs)的实例,面临着搜索潜在相互作用的转录因子结合位点的挑战性问题,而涉及的特定相互作用的知识仍然有限。如果没有对这些工具性能的全面比较,那么它们的可靠性和准确性就不清楚。面对大量解决这个问题的不同工具,我们根据搜索策略和输入数据要求对它们进行了总结和分类。选择了 12 种代表性方法,并将其应用于从 Drosophila CRM 数据库 REDfly 和人类 ENCODE 区域预测 CRM。我们的结果表明,方法的最佳选择取决于物种和所讨论序列的组成。在将 CRM 与非编码区域区分开来时,考虑进化保守性的方法比专门在单个基因组上运行的方法具有更强的预测能力。不同的 CRM 表示和搜索策略依赖于不同的 CRM 属性,并且不同的方法可以相互补充。例如,一些方法有利于结合位点的同型聚类,而另一些方法在短 CRM 上表现最好。此外,大多数方法似乎对其应用的基因组的组成和结构敏感。我们分析了区分表现良好的方法的主要特征,确定了导致性能不佳的弱点,并为用户提供了指导。我们还为未来 CRM 预测方法的开发和评估提出了关键考虑因素。