通过基于序列的矩阵格式和关联规则集实现基因组数据挖掘自动化。

Automating genomic data mining via a sequence-based matrix format and associative rule set.

作者信息

Wren Jonathan D, Johnson David, Gruenwald Le

机构信息

Advanced Center for Genome Technology, Department of Botany and Microbiology, 101 David L, Boren Blvd, Rm 2025.

出版信息

BMC Bioinformatics. 2005 Jul 15;6 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2105-6-S2-S2.

DOI:10.1186/1471-2105-6-S2-S2

PMID:16026599

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1637034/

Abstract

There is an enormous amount of information encoded in each genome--enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands.

摘要

每个基因组中都编码了大量信息——足以创造出有生命、有反应和适应性的生物体。仅原始序列数据不足以理解功能、机制或相互作用。单个碱基对的变化可能导致疾病，如镰状细胞贫血，而一些大的兆碱基缺失却没有明显的表型效应。基因组特征的数据类型各不相同，并且这些特征的注释分布在多个数据库中。在此，我们开发了一种方法，通过迭代探索序列数据以寻找相关性并在此基础上进行构建，从而自动探索基因组。首先，为了整合和比较不同的注释来源，开发了一个序列矩阵（SM）来包含位置相关信息。其次，为矩阵行类型开发了一个分类树，指定了为分析目的每种数据类型相对于其他数据类型应如何处理。第三，开发了相关分析，以根据分类树指导的其他行来分析每个矩阵行的特征，确定哪些分析是合适的。开发了一个原型，并成功检测到基因、外显子、重复元件和CpG岛之间一致的基因组特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8b9f/1637034/c1aece4e3f27/1471-2105-6-S2-S2-1.jpg

相似文献

Automating genomic data mining via a sequence-based matrix format and associative rule set.通过基于序列的矩阵格式和关联规则集实现基因组数据挖掘自动化。

BMC Bioinformatics. 2005 Jul 15;6 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2105-6-S2-S2.

Automated system for gene annotation and metabolic pathway reconstruction using general sequence databases.使用通用序列数据库进行基因注释和代谢途径重建的自动化系统。

Chem Biodivers. 2007 Nov;4(11):2593-602. doi: 10.1002/cbdv.200790212.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Mining sequential patterns for protein fold recognition.挖掘用于蛋白质折叠识别的序列模式。

J Biomed Inform. 2008 Feb;41(1):165-79. doi: 10.1016/j.jbi.2007.05.004. Epub 2007 May 17.

GeneViTo: visualizing gene-product functional and structural features in genomic datasets.GeneViTo：在基因组数据集中可视化基因产物的功能和结构特征。

BMC Bioinformatics. 2003 Oct 31;4:53. doi: 10.1186/1471-2105-4-53.

WindowMasker: window-based masker for sequenced genomes.窗口掩码器：用于测序基因组的基于窗口的掩码器。

Bioinformatics. 2006 Jan 15;22(2):134-41. doi: 10.1093/bioinformatics/bti774. Epub 2005 Nov 15.

DynaPred: a structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations.DynaPred：一种基于结构和序列预测MHC I类结合肽序列和构象的方法。

Bioinformatics. 2006 Jul 15;22(14):e16-24. doi: 10.1093/bioinformatics/btl216.

QuickMap: a public tool for large-scale gene therapy vector insertion site mapping and analysis.QuickMap：一个用于大规模基因治疗载体插入位点作图和分析的公共工具。

Gene Ther. 2009 Jul;16(7):885-93. doi: 10.1038/gt.2009.37. Epub 2009 Apr 23.

Protein superfamily classification using fuzzy rule-based classifier.使用基于模糊规则的分类器进行蛋白质超家族分类。

IEEE Trans Nanobioscience. 2009 Mar;8(1):92-9. doi: 10.1109/TNB.2009.2016484. Epub 2009 Mar 21.

Predicting methylation status of CpG islands in the human brain.预测人类大脑中CpG岛的甲基化状态。

Bioinformatics. 2006 Sep 15;22(18):2204-9. doi: 10.1093/bioinformatics/btl377. Epub 2006 Jul 12.

引用本文的文献

Systematic classification of non-coding RNAs by epigenomic similarity.基于表观遗传相似性的非编码 RNA 系统分类。

BMC Bioinformatics. 2013;14 Suppl 14(Suppl 14):S2. doi: 10.1186/1471-2105-14-S14-S2. Epub 2013 Oct 9.

Epigenomic elements enriched in the promoters of autoimmunity susceptibility genes.在自身免疫易感性基因启动子中富集的表观基因组元件。

Epigenetics. 2014 Feb;9(2):276-85. doi: 10.4161/epi.27021. Epub 2013 Nov 8.

GenomeRunner: automating genome exploration.基因组奔跑者：自动化基因组探索。

Bioinformatics. 2012 Feb 1;28(3):419-20. doi: 10.1093/bioinformatics/btr666. Epub 2011 Dec 6.

Proceedings of the Third Annual Conference of the MidSouth Computational Biology and Bioinformatics Society. Introduction.第三届中南计算生物学与生物信息学学会年会会议记录。引言。

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2105-7-S2-S1.

Proceedings of the second annual conference of the MidSouth Computational Biology and Bioinformatics Society. 7-9 October 2004, Little Rock, Arkansas, USA.美国阿肯色州小石城，2004年10月7日至9日，中南计算生物学与生物信息学学会第二届年会会议记录。

BMC Bioinformatics. 2005 Jul 15;6 Suppl 2(Suppl 2):S1-13. doi: 10.1186/1471-2105-6-S2-S1.

本文引用的文献

Atlas - a data warehouse for integrative bioinformatics.阿特拉斯——一个用于整合生物信息学的数据仓库。

BMC Bioinformatics. 2005 Feb 21;6:34. doi: 10.1186/1471-2105-6-34.

Integrating alternative splicing detection into gene prediction.将可变剪接检测整合到基因预测中。

BMC Bioinformatics. 2005 Feb 10;6:25. doi: 10.1186/1471-2105-6-25.

Data integration: challenges for drug discovery.数据整合：药物研发面临的挑战

Nat Rev Drug Discov. 2005 Jan;4(1):45-58. doi: 10.1038/nrd1608.

GECKO: a complete large-scale gene expression analysis platform.壁虎：一个完整的大规模基因表达分析平台。

BMC Bioinformatics. 2004 Dec 10;5:195. doi: 10.1186/1471-2105-5-195.

A computational approach for ordering signal transduction pathway components from genomics and proteomics Data.一种从基因组学和蛋白质组学数据中对信号转导通路成分进行排序的计算方法。

BMC Bioinformatics. 2004 Oct 25;5:158. doi: 10.1186/1471-2105-5-158.

Predicting co-complexed protein pairs using genomic and proteomic data integration.利用基因组和蛋白质组数据整合预测共复合蛋白质对

BMC Bioinformatics. 2004 Apr 16;5:38. doi: 10.1186/1471-2105-5-38.

AnaBench: a Web/CORBA-based workbench for biomolecular sequence analysis.AnaBench：一个基于Web/CORBA的生物分子序列分析工作台。

BMC Bioinformatics. 2003 Dec 16;4:63. doi: 10.1186/1471-2105-4-63.

Data integration--connecting the dots.数据整合——连接各个点。

Nat Biotechnol. 2003 Aug;21(8):844-5. doi: 10.1038/nbt0803-844.

Integrating biological databases.整合生物数据库。

Nat Rev Genet. 2003 May;4(5):337-45. doi: 10.1038/nrg1065.

GeneCards 2002: towards a complete, object-oriented, human gene compendium.《基因卡片2002：迈向完整的、面向对象的人类基因纲要》

Bioinformatics. 2002 Nov;18(11):1542-3. doi: 10.1093/bioinformatics/18.11.1542.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过基于序列的矩阵格式和关联规则集实现基因组数据挖掘自动化。

Automating genomic data mining via a sequence-based matrix format and associative rule set.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献