利用信息融合对多样的基因组数据进行聚类分析。

Clustering of diverse genomic data using information fusion.

作者信息

Kasturi Jyotsna, Acharya Raj

机构信息

Department of Computer Science and Engineering, Pennsylvania State University University Park, PA 16802, USA.

出版信息

Bioinformatics. 2005 Feb 15;21(4):423-9. doi: 10.1093/bioinformatics/bti186. Epub 2004 Dec 17.

DOI:10.1093/bioinformatics/bti186

PMID:15608052

Abstract

MOTIVATION

Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data.

RESULTS

The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities.

AVAILABILITY

Software code is available by request from the first author.

CONTACT

jkasturi@cse.psu.edu.

摘要

动机

基因组测序项目以及诸如DNA和蛋白质阵列等高通量技术已产生了大量信息丰富的数据。微阵列实验数据是推断基因组规模基因调控机制的宝贵但有限的来源。基因的启动子序列/DNA结合基序、基因本体论和定位数据等其他信息，与基因表达分析相结合时，可以提高发现结果的统计显著性。本文介绍一种用于组合异质基因组数据的信息融合机器学习方法。该算法使用一种无监督联合学习机制，利用组合数据识别基因簇。