Stanton Jo-Ann L, Macgregor Andrew B, Green David P L
Department of Anatomy and Structural Biology, University of Otago Medical School, Dunedin, New Zealand.
Appl Bioinformatics. 2003;2(3 Suppl):S65-73.
There is considerable interest in the gene expression profiles that underpin the phenotypes of cells and tissues. We have developed Bioperl scripts for mining the National Institutes of Health (NIH) UniGene databases to identify this tissue-enriched gene expression. UniGene imports expressed sequence tags (ESTs) from the NIH dbEST database and clusters them by searching for sequence matches. In principle, each UniGene cluster represents the product(s) of a single transcriptional unit in the genome. This transcriptional unit can be expressed in a range of cell types, and UniGene clusters reflect these heterogeneous origins. UniGene clusters containing ESTs expressed predominantly or uniquely by one tissue will show a high proportion of ESTs from that tissue. Our Bioperl scripts parse the NIH UniGene data files as a starting point for an in-house UniGene database. Each UniGene cluster is then assessed for the total number of ESTs from a specified set of dbEST libraries and the total number of ESTs in the cluster. The ratio of the two gives a measure of enrichment. In this paper, we identify tissue-enriched gene expression in mouse pancreas, mammary gland and heart. Each tissue-enriched expression profile identifies genes that are recognisably characteristic of the respective tissue. It also identifies significant numbers of tissue-enhanced UniGenes that are derived from transcriptional units with no known function. These genes may play important and specialised functions in the tissue in question and offer targets for drug action.
人们对构成细胞和组织表型基础的基因表达谱有着浓厚兴趣。我们开发了Bioperl脚本,用于挖掘美国国立卫生研究院(NIH)的UniGene数据库,以识别这种组织富集的基因表达。UniGene从NIH的dbEST数据库中导入表达序列标签(EST),并通过搜索序列匹配对它们进行聚类。原则上,每个UniGene簇代表基因组中单个转录单元的产物。这个转录单元可以在一系列细胞类型中表达,并且UniGene簇反映了这些异质来源。包含主要或仅由一种组织表达的EST的UniGene簇将显示来自该组织的EST的高比例。我们的Bioperl脚本解析NIH UniGene数据文件,作为内部UniGene数据库的起点。然后评估每个UniGene簇来自一组指定的dbEST文库的EST总数以及该簇中的EST总数。两者的比率给出了富集程度的度量。在本文中,我们识别了小鼠胰腺、乳腺和心脏中的组织富集基因表达。每个组织富集表达谱都识别出各自组织中明显具有特征的基因。它还识别出大量来自功能未知的转录单元的组织增强型UniGenes。这些基因可能在相关组织中发挥重要和特殊的功能,并提供药物作用的靶点。