Kasturi Jyotsna, Acharya Raj
Department of Computer Science and Engineering, Pennsylvania State University University Park, PA 16802, USA.
Bioinformatics. 2005 Feb 15;21(4):423-9. doi: 10.1093/bioinformatics/bti186. Epub 2004 Dec 17.
Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data.
The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities.
Software code is available by request from the first author.
基因组测序项目以及诸如DNA和蛋白质阵列等高通量技术已产生了大量信息丰富的数据。微阵列实验数据是推断基因组规模基因调控机制的宝贵但有限的来源。基因的启动子序列/DNA结合基序、基因本体论和定位数据等其他信息,与基因表达分析相结合时,可以提高发现结果的统计显著性。本文介绍一种用于组合异质基因组数据的信息融合机器学习方法。该算法使用一种无监督联合学习机制,利用组合数据识别基因簇。
本文使用公开可用的酵母细胞周期数据,研究了从不同实验条件获得的基因表达时间序列模式与其上游序列中几个不同且重复的基序之间的相关性。结果表明,本文采用的组合学习方法有效地识别了相关基因。该算法提供了一种自动聚类方法,但允许用户使用概率先验指定每种数据类型对最终聚类的影响。
可向第一作者索取软件代码。