Wijaya Edward, Yiu Siu-Ming, Son Ngo Thanh, Kanagasabai Rajaraman, Sung Wing-Kin
School of Computing, National University of Singapore, Singapore.
Bioinformatics. 2008 Oct 15;24(20):2288-95. doi: 10.1093/bioinformatics/btn420. Epub 2008 Aug 12.
Locating transcription factor binding sites (motifs) is a key step in understanding gene regulation. Based on Tompa's benchmark study, the performance of current de novo motif finders is far from satisfactory (with sensitivity <or=0.222 and precision <or=0.307). The same study also shows that no motif finder performs consistently well over all datasets. Hence, it is not clear which finder one should use for a given dataset. To address this issue, a class of algorithms called ensemble methods have been proposed. Though the existing ensemble methods overall perform better than stand-alone motif finders, the improvement gained is not substantial. Our study reveals that these methods do not fully exploit the information obtained from the results of individual finders, resulting in minor improvement in sensitivity and poor precision.
In this article, we identify several key observations on how to utilize the results from individual finders and design a novel ensemble method, MotifVoter, to predict the motifs and binding sites. Evaluations on 186 datasets show that MotifVoter can locate more than 95% of the binding sites found by its component motif finders. In terms of sensitivity and precision, MotifVoter outperforms stand-alone motif finders and ensemble methods significantly on Tompa's benchmark, Escherichia coli, and ChIP-Chip datasets. MotifVoter is available online via a web server with several biologist-friendly features.
定位转录因子结合位点(基序)是理解基因调控的关键步骤。基于汤帕的基准研究,当前从头基序发现工具的性能远不能令人满意(灵敏度≤0.222,精确率≤0.307)。同一研究还表明,没有一种基序发现工具在所有数据集上都能始终表现良好。因此,不清楚对于给定的数据集应该使用哪种发现工具。为了解决这个问题,已经提出了一类称为集成方法的算法。虽然现有的集成方法总体上比单独的基序发现工具表现更好,但获得的改进并不显著。我们的研究表明,这些方法没有充分利用从各个发现工具的结果中获得的信息,导致灵敏度略有提高而精确率较差。
在本文中,我们确定了关于如何利用各个发现工具的结果的几个关键观察结果,并设计了一种新颖的集成方法MotifVoter来预测基序和结合位点。对186个数据集的评估表明,MotifVoter能够定位其组成基序发现工具所发现的95%以上的结合位点。在灵敏度和精确率方面,MotifVoter在汤帕的基准数据集、大肠杆菌数据集和芯片杂交数据集上显著优于单独的基序发现工具和集成方法。MotifVoter可通过一个具有若干对生物学家友好功能的网络服务器在线获取。