Suppr超能文献

基于染色质免疫沉淀测序(ChIP-seq)数据优化选择PWM基序数据库和序列扫描方法。

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data.

作者信息

Dabrowski Michal, Dojer Norbert, Krystkowiak Izabella, Kaminska Bozena, Wilczynski Bartek

机构信息

Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland.

Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland.

出版信息

BMC Bioinformatics. 2015 May 1;16:140. doi: 10.1186/s12859-015-0573-5.

Abstract

BACKGROUND

For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter.

RESULTS

We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate.

CONCLUSION

We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/ .

摘要

背景

多年来,转录因子的结合偏好一直通过所谓的基序来描述,通常由位置权重矩阵或类似模型进行数学定义,目的是预测潜在的结合位点。然而,尽管公共和商业数据库中有数千个基序模型可供使用,但想要使用这些模型的研究人员在识别感兴趣基因组中的潜在结合位点时,面临着许多相互竞争的方法,而且关于不同选择的最优性,几乎没有公开的信息。由于有大量不同的基序模型以及一些描述数百个转录因子-染色质免疫沉淀测序(TF-ChIP-seq)对中实际转录因子结合情况的实验数据集,我们着手对此事进行全面分析。

结果

我们专注于在人类基因组中识别潜在转录因子结合位点的任务。首先,我们对不同数据库中可用模型的覆盖范围和质量进行了全面比较,结果表明公共数据库的转录因子覆盖范围相当,且基序性能比商业数据库更好。其次,我们比较了不同的基序扫描器,结果表明,无论使用哪个数据库,科学界开发的工具都优于商业工具。第三,我们为每个基序计算一个检测阈值,以优化预测的准确性。最后,我们对所有基序先验选择阈值的不同方法进行了深入比较。令人惊讶的是,我们发现选择一个共同的假阳性率所得到的结果受基序信息含量的偏差最小,因此最为统一准确。

结论

我们为研究转录因子基序的研究人员提供了一份指南。该指南在http://bioputer.mimuw.edu.pl/papers/motifs/ 上补充了详细的分析结果和基准数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/03aa/4436866/08dc5b2f6130/12859_2015_573_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验