Department of Neurobiology, A.I. Virtanen Institute, University of Kuopio, Kuopio, Finland.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):37-49. doi: 10.1109/TCBB.2008.56.
Segmentation aims to separate homogeneous areas from the sequential data, and plays a central role in data mining. It has applications ranging from finance to molecular biology, where bioinformatics tasks such as genome data analysis are active application fields. In this paper, we present a novel application of segmentation in locating genomic regions with coexpressed genes. We aim at automated discovery of such regions without requirement for user-given parameters. In order to perform the segmentation within a reasonable time, we use heuristics. Most of the heuristic segmentation algorithms require some decision on the number of segments. This is usually accomplished by using asymptotic model selection methods like the Bayesian information criterion. Such methods are based on some simplification, which can limit their usage. In this paper, we propose a Bayesian model selection to choose the most proper result from heuristic segmentation. Our Bayesian model presents a simple prior for the segmentation solutions with various segment numbers and a modified Dirichlet prior for modeling multinomial data. We show with various artificial data sets in our benchmark system that our model selection criterion has the best overall performance. The application of our method in yeast cell-cycle gene expression data reveals potential active and passive regions of the genome.
分割旨在将同质区域从序列数据中分离出来,在数据挖掘中起着核心作用。它的应用范围从金融到分子生物学,生物信息学任务,如基因组数据分析是一个活跃的应用领域。在本文中,我们提出了一种分割在定位具有共表达基因的基因组区域中的新应用。我们的目标是在不需要用户给定参数的情况下自动发现这些区域。为了在合理的时间内执行分割,我们使用启发式算法。大多数启发式分割算法都需要对段数做出一些决策。这通常通过使用渐近模型选择方法(如贝叶斯信息准则)来完成。这种方法基于一些简化,这可能限制它们的使用。在本文中,我们提出了一种贝叶斯模型选择方法,从启发式分割中选择最合适的结果。我们的贝叶斯模型为具有不同段数的分割解决方案提供了一个简单的先验,为建模多项式数据提供了一个修改后的狄利克雷先验。我们在基准系统中使用各种人工数据集表明,我们的模型选择标准具有最佳的整体性能。我们的方法在酵母细胞周期基因表达数据中的应用揭示了基因组的潜在活跃和被动区域。