Suppr超能文献

集成属性轮廓聚类:发现并表征具有相似生物学特征模式的基因群体。

Ensemble attribute profile clustering: discovering and characterizing groups of genes with similar patterns of biological features.

作者信息

Semeiks J R, Rizki A, Bissell M J, Mian I S

机构信息

Life Sciences Division (MS 977-225A), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.

出版信息

BMC Bioinformatics. 2006 Mar 16;7:147. doi: 10.1186/1471-2105-7-147.

Abstract

BACKGROUND

Ensemble attribute profile clustering is a novel, text-based strategy for analyzing a user-defined list of genes and/or proteins. The strategy exploits annotation data present in gene-centered corpora and utilizes ideas from statistical information retrieval to discover and characterize properties shared by subsets of the list. The practical utility of this method is demonstrated by employing it in a retrospective study of two non-overlapping sets of genes defined by a published investigation as markers for normal human breast luminal epithelial cells and myoepithelial cells.

RESULTS

Each genetic locus was characterized using a finite set of biological properties and represented as a vector of features indicating attributes associated with the locus (a gene attribute profile). In this study, the vector space models for a pre-defined list of genes were constructed from the Gene Ontology (GO) terms and the Conserved Domain Database (CDD) protein domain terms assigned to the loci by the gene-centered corpus LocusLink. This data set of GO- and CDD-based gene attribute profiles, vectors of binary random variables, was used to estimate multiple finite mixture models and each ensuing model utilized to partition the profiles into clusters. The resultant partitionings were combined using a unanimous voting scheme to produce consensus clusters, sets of profiles that co-occurred consistently in the same cluster. Attributes that were important in defining the genes assigned to a consensus cluster were identified. The clusters and their attributes were inspected to ascertain the GO and CDD terms most associated with subsets of genes and in conjunction with external knowledge such as chromosomal location, used to gain functional insights into human breast biology. The 52 luminal epithelial cell markers and 89 myoepithelial cell markers are disjoint sets of genes. Ensemble attribute profile clustering-based analysis indicated that both lists contained groups of genes with the functional properties of membrane receptor biology/signal transduction and nucleic acid binding/transcription. A subset of the luminal markers was associated with metabolic and oxidoreductase activities, whereas a subset of myoepithelial markers was associated with protein hydrolase activity.

CONCLUSION

Given a set of genes and/or proteins associated with a phenomenon, process or system of interest, ensemble attribute profile clustering provides a simple method for collating and sythesizing the annotation data pertaining to them that are present in text-based, gene-centered corpora. The results provide information about properties common and unique to subsets of the list and hence insights into the biology of the problem under investigation.

摘要

背景

整合属性概况聚类是一种新颖的、基于文本的策略,用于分析用户定义的基因和/或蛋白质列表。该策略利用以基因为中心的语料库中存在的注释数据,并运用统计信息检索的理念来发现和描述列表子集中共享的属性。通过将该方法应用于一项回顾性研究,对由一项已发表的调查定义的两组不重叠基因集进行分析,以此证明了该方法的实际效用,这两组基因分别作为正常人类乳腺腔上皮细胞和肌上皮细胞的标志物。

结果

每个基因位点都使用一组有限的生物学属性进行表征,并表示为一个特征向量,该向量指示与该位点相关的属性(基因属性概况)。在本研究中,针对预定义的基因列表构建向量空间模型,这些模型是根据基因本体论(GO)术语和保守结构域数据库(CDD)蛋白质结构域术语构建的,这些术语由以基因为中心的语料库LocusLink分配给各个位点。这个基于GO和CDD的基因属性概况数据集,即二元随机变量向量,用于估计多个有限混合模型,每个随后的模型用于将这些概况划分为不同的簇。使用一致投票方案将得到的划分结果进行合并,以产生共识簇,即始终共同出现在同一簇中的概况集。确定了在定义分配给共识簇的基因时重要的属性。对这些簇及其属性进行检查,以确定与基因子集最相关的GO和CDD术语,并结合诸如染色体位置等外部知识,从而深入了解人类乳腺生物学的功能。52个腔上皮细胞标志物和89个肌上皮细胞标志物是不相交的基因集。基于整合属性概况聚类的分析表明,这两个列表都包含具有膜受体生物学/信号转导和核酸结合/转录功能特性的基因组。一部分腔标志物与代谢和氧化还原酶活性相关,而一部分肌上皮标志物与蛋白水解酶活性相关。

结论

对于一组与感兴趣的现象、过程或系统相关的基因和/或蛋白质,整合属性概况聚类提供了一种简单的方法,用于整理和综合基于文本的、以基因为中心的语料库中与它们相关的注释数据。结果提供了有关列表子集共有的和独特的属性的信息,从而深入了解所研究问题的生物学特性。

相似文献

3
Attribute clustering for grouping, selection, and classification of gene expression data.
IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):83-101. doi: 10.1109/TCBB.2005.17.
4
Clustering and re-clustering for pattern discovery in gene expression data.
J Bioinform Comput Biol. 2005 Apr;3(2):281-301. doi: 10.1142/s0219720005001053.
5
A new clustering method for microarray data analysis.
Proc IEEE Comput Soc Bioinform Conf. 2002;1:268-75.
6
CLICK and EXPANDER: a system for clustering and visualizing gene expression data.
Bioinformatics. 2003 Sep 22;19(14):1787-99. doi: 10.1093/bioinformatics/btg232.
7
Comparisons and validation of statistical clustering techniques for microarray gene expression data.
Bioinformatics. 2003 Mar 1;19(4):459-66. doi: 10.1093/bioinformatics/btg025.
8
Discovering biclusters in gene expression data based on high-dimensional linear geometries.
BMC Bioinformatics. 2008 Apr 23;9:209. doi: 10.1186/1471-2105-9-209.

引用本文的文献

1
AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology.
Nucleic Acids Res. 2009 Jul;37(Web Server issue):W63-7. doi: 10.1093/nar/gkp430. Epub 2009 May 27.
2
P311 functions in an alternative pathway of lipid accumulation that is induced by retinoic acid.
J Cell Sci. 2008 Aug 15;121(Pt 16):2751-8. doi: 10.1242/jcs.027151. Epub 2008 Jul 29.
3
A human breast cell model of preinvasive to invasive transition.
Cancer Res. 2008 Mar 1;68(5):1378-87. doi: 10.1158/0008-5472.CAN-07-2225.

本文引用的文献

2
Subset clustering of binary sequences, with an application to genomic abnormality data.
Biometrics. 2005 Dec;61(4):1027-36. doi: 10.1111/j.1541-0420.2005.00381.x.
3
Systematic association of genes to phenotypes by genome and literature mining.
PLoS Biol. 2005 May;3(5):e134. doi: 10.1371/journal.pbio.0030134. Epub 2005 Apr 5.
5
Molecular characterization of the tumor microenvironment in breast cancer.
Cancer Cell. 2004 Jul;6(1):17-32. doi: 10.1016/j.ccr.2004.06.010.
6
TXTGate: profiling gene groups with text-based information.
Genome Biol. 2004;5(6):R43. doi: 10.1186/gb-2004-5-6-r43. Epub 2004 May 28.
7
The evolutionary dynamics of eukaryotic gene order.
Nat Rev Genet. 2004 Apr;5(4):299-310. doi: 10.1038/nrg1319.
9
GOstat: find statistically overrepresented Gene Ontologies within a group of genes.
Bioinformatics. 2004 Jun 12;20(9):1464-5. doi: 10.1093/bioinformatics/bth088. Epub 2004 Feb 12.
10
The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61. doi: 10.1093/nar/gkh036.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验