DISCOVER：一种用于在复杂基因组中进行基序搜索的基于特征的判别方法。

DISCOVER: a feature-based discriminative method for motif search in complex genomes.

作者信息

Fu Wenjie, Ray Pradipta, Xing Eric P

机构信息

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

出版信息

Bioinformatics. 2009 Jun 15;25(12):i321-9. doi: 10.1093/bioinformatics/btp230.

DOI:10.1093/bioinformatics/btp230

PMID:19478006

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2687984/

Abstract

MOTIVATION

Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features.

RESULTS

This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score.

AVAILABILITY AND IMPLEMENTATION

The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在后生动物基因组中识别编码复杂调控信号的转录因子结合位点（TFBSs）仍然是计算基因组学中的一个具有挑战性的问题。由于结合位点实例或基序之间核苷酸含量的简并性，以及顺式调控模块（CRM）内基序复杂的“语法组织”，现有的基于模式匹配的计算机基序搜索方法往往存在不切实际的高假阳性率，特别是在分析大型基因组数据集以及表征结合位点的有噪声的位置权重矩阵的情况下。在这里，我们试图通过使用一个框架来解决这个问题，该框架最大限度地利用查询区域中基因组DNA的信息内容，从查询区域中各种具有生物学意义的遗传和表观遗传因素的值中获取线索，如特定进化枝的进化参数、附近编码区域的有无等。我们提出了一种在后生动物基因组中预测TFBS的新方法，该方法同时利用了序列的CRM结构和单个基序的各种特征。我们提出的方法基于一种称为条件随机场的判别概率模型，该模型基于所有这些特征的联合效应，明确优化大序列中基序存在的预测概率。

结果

该模型克服了早期基于不太有效的统计形式主义的方法的弱点，这些方法对数据中的虚假信号敏感。我们将我们的方法与广泛的现有模型进行比较，在模拟的CRM和真实的果蝇序列上进行评估，F1分数比现有技术高出22%。

可用性和实现

代码可在http://www.sailing.cs.cmu.edu/discover.html上公开获取。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa43/2687984/aff9b66f63e0/btp230f1.jpg

相似文献

DISCOVER: a feature-based discriminative method for motif search in complex genomes.

Bioinformatics. 2009 Jun 15;25(12):i321-9. doi: 10.1093/bioinformatics/btp230.

Alignment and prediction of cis-regulatory modules based on a probabilistic model of evolution.

PLoS Comput Biol. 2009 Mar;5(3):e1000299. doi: 10.1371/journal.pcbi.1000299. Epub 2009 Mar 13.

Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA.

Bioinformatics. 2006 Dec 1;22(23):2858-64. doi: 10.1093/bioinformatics/btl499. Epub 2006 Oct 10.

COPS: detecting co-occurrence and spatial arrangement of transcription factor binding motifs in genome-wide datasets.

PLoS One. 2012;7(12):e52055. doi: 10.1371/journal.pone.0052055. Epub 2012 Dec 18.

De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets.

BMC Genomics. 2014 Dec 2;15:1047. doi: 10.1186/1471-2164-15-1047.

An intuitionistic approach to scoring DNA sequences against transcription factor binding site motifs.

BMC Bioinformatics. 2010 Nov 8;11:551. doi: 10.1186/1471-2105-11-551.

CisMiner: genome-wide in-silico cis-regulatory module prediction by fuzzy itemset mining.

PLoS One. 2014 Sep 30;9(9):e108065. doi: 10.1371/journal.pone.0108065. eCollection 2014.

Prediction of similarly acting cis-regulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D.pseudoobscura.

Bioinformatics. 2004 Nov 1;20(16):2738-50. doi: 10.1093/bioinformatics/bth320. Epub 2004 May 14.

MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs.

Nucleic Acids Res. 2008 Aug;36(13):4488-97. doi: 10.1093/nar/gkn407. Epub 2008 Jul 7.

Learning probabilistic models of cis-regulatory modules that represent logical and spatial aspects.

Bioinformatics. 2007 Jan 15;23(2):e156-62. doi: 10.1093/bioinformatics/btl319.

引用本文的文献

Protein-DNA binding in high-resolution.

Crit Rev Biochem Mol Biol. 2015;50(4):269-83. doi: 10.3109/10409238.2015.1051505. Epub 2015 Jun 3.

Discriminative motif optimization based on perceptron training.

Bioinformatics. 2014 Apr 1;30(7):941-8. doi: 10.1093/bioinformatics/btt748. Epub 2013 Dec 24.

CTF: a CRF-based transcription factor binding sites finding system.

BMC Genomics. 2012;13 Suppl 8(Suppl 8):S18. doi: 10.1186/1471-2164-13-S8-S18. Epub 2012 Dec 17.

DNA structural properties in the classification of genomic transcription regulation elements.

Bioinform Biol Insights. 2012;6:155-68. doi: 10.4137/BBI.S9426. Epub 2012 Jul 2.

本文引用的文献

A feature-based approach to modeling protein-DNA interactions.

PLoS Comput Biol. 2008 Aug 22;4(8):e1000154. doi: 10.1371/journal.pcbi.1000154.

Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences.

Bioinformatics. 2008 Jul 1;24(13):i165-71. doi: 10.1093/bioinformatics/btn154.

CSMET: comparative genomic motif detection via multi-resolution phylogenetic shadowing.

PLoS Comput Biol. 2008 Jun 6;4(6):e1000090. doi: 10.1371/journal.pcbi.1000090.

Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites.

BMC Bioinformatics. 2008 Jun 4;9:262. doi: 10.1186/1471-2105-9-262.

Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection.

Bioinformatics. 2008 May 15;24(10):1264-70. doi: 10.1093/bioinformatics/btn112. Epub 2008 Mar 31.

Systematic functional characterization of cis-regulatory motifs in human core promoters.

Genome Res. 2008 Mar;18(3):477-88. doi: 10.1101/gr.6828808. Epub 2008 Feb 6.

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction.

Genome Biol. 2007;8(12):R269. doi: 10.1186/gb-2007-8-12-r269.

MORPH: probabilistic alignment combined with hidden Markov models of cis-regulatory modules.

PLoS Comput Biol. 2007 Nov;3(11):e216. doi: 10.1371/journal.pcbi.0030216. Epub 2007 Sep 24.

A nucleosome-guided map of transcription factor binding sites in yeast.

PLoS Comput Biol. 2007 Nov;3(11):e215. doi: 10.1371/journal.pcbi.0030215. Epub 2007 Sep 24.

Computational analyses of eukaryotic promoters.

BMC Bioinformatics. 2007 Sep 27;8 Suppl 6(Suppl 6):S3. doi: 10.1186/1471-2105-8-S6-S3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

DISCOVER：一种用于在复杂基因组中进行基序搜索的基于特征的判别方法。

DISCOVER: a feature-based discriminative method for motif search in complex genomes.

作者信息

Fu Wenjie, Ray Pradipta, Xing Eric P

机构信息

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.