Dvorkin Daniel, Biehs Brian, Kechris Katerina
Computational Bioscience Program, University of Colorado School of Medicine, 12801 E. 17th Ave., Aurora, CO 80045–0511, USA.
Stat Appl Genet Mol Biol. 2013 Aug;12(4):469-87. doi: 10.1515/sagmb-2012-0051.
Making effective use of multiple data sources is a major challenge in modern bioinformatics. Genome-wide data such as measures of transcription factor binding, gene expression, and sequence conservation, which are used to identify binding regions and genes that are important to major biological processes such as development and disease, can be difficult to use together due to the different biological meanings and statistical distributions of the heterogeneous data types, but each can provide valuable information for understanding the processes under study. Here we present methods for integrating multiple data sources to gain a more complete picture of gene regulation and expression. Our goal is to identify genes and cis-regulatory regions which play specific biological roles. We describe a graphical mixture model approach for data integration, examine the effect of using different model topologies, and discuss methods for evaluating the effectiveness of the models. Model fitting is computationally efficient and produces results which have clear biological and statistical interpretations. The Hedgehog and Dorsal signaling pathways in Drosophila, which are critical in embryonic development, are used as examples.
有效利用多个数据源是现代生物信息学中的一项重大挑战。全基因组数据,如转录因子结合、基因表达和序列保守性的测量数据,用于识别对发育和疾病等主要生物过程至关重要的结合区域和基因。由于异构数据类型具有不同的生物学意义和统计分布,这些数据很难一起使用,但每种数据都能为理解所研究的过程提供有价值的信息。在此,我们提出整合多个数据源的方法,以更全面地了解基因调控和表达。我们的目标是识别发挥特定生物学作用的基因和顺式调控区域。我们描述了一种用于数据整合的图形混合模型方法,研究了使用不同模型拓扑结构的效果,并讨论了评估模型有效性的方法。模型拟合在计算上效率很高,并且产生的结果具有清晰的生物学和统计学解释。以果蝇中对胚胎发育至关重要的刺猬信号通路和背侧信号通路为例进行说明。