基于信息论模型的全基因组结合位点预测、展示与优化

Genome-wide prediction, display and refinement of binding sites with information theory-based models.

作者信息

Gadiraju Sashidhar, Vyhlidal Carrie A, Leeder J Steven, Rogan Peter K

机构信息

Laboratory of Human Molecular Genetics, Children's Mercy Hospital and Clinics, School of Medicine, and School of Interdisciplinary Computer Science and Engineering University of Missouri-Kansas City, Kansas City, MO 64108 USA.

出版信息

BMC Bioinformatics. 2003 Sep 8;4:38. doi: 10.1186/1471-2105-4-38.

DOI:10.1186/1471-2105-4-38

PMID:12962546

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC200970/

Abstract

BACKGROUND

We present Delila-genome, a software system for identification, visualization and analysis of protein binding sites in complete genome sequences. Binding sites are predicted by scanning genomic sequences with information theory-based (or user-defined) weight matrices. Matrices are refined by adding experimentally-defined binding sites to published binding sites. Delila-Genome was used to examine the accuracy of individual information contents of binding sites detected with refined matrices as a measure of the strengths of the corresponding protein-nucleic acid interactions. The software can then be used to predict novel sites by rescanning the genome with the refined matrices.

RESULTS

Parameters for genome scans are entered using a Java-based GUI interface and backend scripts in Perl. Multi-processor CPU load-sharing minimized the average response time for scans of different chromosomes. Scans of human genome assemblies required 4-6 hours for transcription factor binding sites and 10-19 hours for splice sites, respectively, on 24- and 3-node Mosix and Beowulf clusters. Individual binding sites are displayed either as high-resolution sequence walkers or in low-resolution custom tracks in the UCSC genome browser. For large datasets, we applied a data reduction strategy that limited displays of binding sites exceeding a threshold information content to specific chromosomal regions within or adjacent to genes. An HTML document is produced listing binding sites ranked by binding site strength or chromosomal location hyperlinked to the UCSC custom track, other annotation databases and binding site sequences. Post-genome scan tools parse binding site annotations of selected chromosome intervals and compare the results of genome scans using different weight matrices. Comparisons of multiple genome scans can display binding sites that are unique to each scan and identify sites with significantly altered binding strengths.

CONCLUSIONS

Delila-Genome was used to scan the human genome sequence with information weight matrices of transcription factor binding sites, including PXR/RXRalpha, AHR and NF-kappaB p50/p65, and matrices for RNA binding sites including splice donor, acceptor, and SC35 recognition sites. Comparisons of genome scans with the original and refined PXR/RXRalpha information weight matrices indicate that the refined model more accurately predicts the strengths of known binding sites and is more sensitive for detection of novel binding sites.

摘要

背景

我们展示了Delila-genome，这是一个用于在完整基因组序列中识别、可视化和分析蛋白质结合位点的软件系统。通过使用基于信息论（或用户定义）的权重矩阵扫描基因组序列来预测结合位点。通过将实验确定的结合位点添加到已发表的结合位点中对矩阵进行优化。Delila-Genome用于检验用优化矩阵检测到的结合位点的个体信息含量的准确性，以此作为相应蛋白质-核酸相互作用强度的一种度量。然后该软件可用于通过用优化矩阵重新扫描基因组来预测新的位点。

结果

基因组扫描参数通过基于Java的图形用户界面（GUI）和Perl后端脚本输入。多处理器CPU负载分担使不同染色体扫描的平均响应时间最小化。在24节点和3节点的Mosix和Beowulf集群上，对人类基因组组装体进行扫描，转录因子结合位点需要4 - 6小时，剪接位点需要10 - 19小时。单个结合位点可以在UCSC基因组浏览器中以高分辨率序列游走器或低分辨率自定义轨迹的形式显示。对于大型数据集，我们应用了一种数据缩减策略，将超过阈值信息含量的结合位点显示限制在基因内部或相邻的特定染色体区域。生成一个HTML文档，列出按结合位点强度或染色体位置排序的结合位点，这些位点超链接到UCSC自定义轨迹、其他注释数据库和结合位点序列。基因组扫描后工具解析选定染色体区间的结合位点注释，并使用不同权重矩阵比较基因组扫描结果。多次基因组扫描的比较可以显示每次扫描特有的结合位点，并识别结合强度有显著变化的位点。

结论

Delila-Genome用于使用转录因子结合位点的信息权重矩阵（包括PXR/RXRalpha、AHR和NF-κB p50/p65）以及RNA结合位点的矩阵（包括剪接供体、受体和SC35识别位点）扫描人类基因组序列。对原始和优化的PXR/RXRalpha信息权重矩阵进行基因组扫描的比较表明，优化后的模型能更准确地预测已知结合位点的强度，并且对新结合位点的检测更敏感。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aa86/200970/ebb27f2bf4f1/1471-2105-4-38-1.jpg

相似文献

Genome-wide prediction, display and refinement of binding sites with information theory-based models.

BMC Bioinformatics. 2003 Sep 8;4:38. doi: 10.1186/1471-2105-4-38.

Statistical Viewer: a tool to upload and integrate linkage and association data as plots displayed within the Ensembl genome browser.

BMC Bioinformatics. 2005 Apr 12;6:95. doi: 10.1186/1471-2105-6-95.

Tandem machine learning for the identification of genes regulated by transcription factors.

BMC Bioinformatics. 2005 Aug 22;6:204. doi: 10.1186/1471-2105-6-204.

GeneViTo: visualizing gene-product functional and structural features in genomic datasets.

BMC Bioinformatics. 2003 Oct 31;4:53. doi: 10.1186/1471-2105-4-53.

Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes.

Bioinformatics. 2005 Nov 15;21(22):4187-9. doi: 10.1093/bioinformatics/bti635. Epub 2005 Aug 18.

MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes.

BMC Bioinformatics. 2005 Mar 30;6:79. doi: 10.1186/1471-2105-6-79.

The UCSC genome browser database: update 2007.

Nucleic Acids Res. 2007 Jan;35(Database issue):D668-73. doi: 10.1093/nar/gkl928. Epub 2006 Nov 16.

The UCSC Genome Browser Database: update 2009.

Nucleic Acids Res. 2009 Jan;37(Database issue):D755-61. doi: 10.1093/nar/gkn875. Epub 2008 Nov 7.

JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D91-4. doi: 10.1093/nar/gkh012.

Integrating alternative splicing detection into gene prediction.

BMC Bioinformatics. 2005 Feb 10;6:25. doi: 10.1186/1471-2105-6-25.

引用本文的文献

Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest.

BMC Bioinformatics. 2014 Mar 7;15:66. doi: 10.1186/1471-2105-15-66.

Comparing binding site information to binding affinity reveals that Crp/DNA complexes have several distinct binding conformers.

Nucleic Acids Res. 2011 Aug;39(15):6813-24. doi: 10.1093/nar/gkr369. Epub 2011 May 17.

TRII: A Probabilistic Scoring of Drosophila melanogaster Translation Initiation Sites.

EURASIP J Bioinform Syst Biol. 2010;2010(1):814127. doi: 10.1155/2010/814127. Epub 2010 Dec 27.

Tandem machine learning for the identification of genes regulated by transcription factors.

BMC Bioinformatics. 2005 Aug 22;6:204. doi: 10.1186/1471-2105-6-204.

Bipartite pattern discovery by entropy minimization-based multiple local alignment.

Nucleic Acids Res. 2004 Sep 23;32(17):4979-91. doi: 10.1093/nar/gkh825. Print 2004.

MPSS profiling of human embryonic stem cells.

BMC Dev Biol. 2004 Aug 10;4:10. doi: 10.1186/1471-213X-4-10.

本文引用的文献

Information theory-based analysis of CYP2C19, CYP2D6 and CYP3A5 splicing mutations.

Pharmacogenetics. 2003 Apr;13(4):207-18. doi: 10.1097/00008571-200304000-00005.

SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation.

Proc Natl Acad Sci U S A. 2002 Jul 23;99(15):9888-93. doi: 10.1073/pnas.152320899. Epub 2002 Jul 9.

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome.

Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):757-62. doi: 10.1073/pnas.231608898.

Rifampin is a selective, pleiotropic inducer of drug metabolism genes in human hepatocytes: studies with cDNA and oligonucleotide expression arrays.

J Pharmacol Exp Ther. 2001 Dec;299(3):849-57.

Anatomy of Escherichia coli ribosome binding sites.

J Mol Biol. 2001 Oct 12;313(1):215-28. doi: 10.1006/jmbi.2001.5040.

Monitoring expression of genes involved in drug metabolism and toxicology using DNA microarrays.

Physiol Genomics. 2001 Apr 27;5(4):161-70. doi: 10.1152/physiolgenomics.2001.5.4.161.

Characterization of human RNA splice signals by iterative functional selection of splice sites.

RNA. 2000 Apr;6(4):528-44. doi: 10.1017/s1355838200992033.

Measuring molecular information.

J Theor Biol. 1999 Nov 7;201(1):87-92. doi: 10.1006/jtbi.1999.1012.

OxyR and SoxRS regulation of fur.

J Bacteriol. 1999 Aug;181(15):4639-43. doi: 10.1128/JB.181.15.4639-4643.1999.

Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX.

Nucleic Acids Res. 1999 Feb 1;27(3):882-7. doi: 10.1093/nar/27.3.882.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于信息论模型的全基因组结合位点预测、展示与优化

Genome-wide prediction, display and refinement of binding sites with information theory-based models.

作者信息

Gadiraju Sashidhar, Vyhlidal Carrie A, Leeder J Steven, Rogan Peter K

机构信息