Raghu Vineet K, Ge Xiaoyu, Chrysanthis Panos K, Benos Panayiotis V
Department of Computer Science, University of Pittsburgh.
Department of Computational and Systems Biology, University of Pittsburgh.
Proc Int Conf Data Eng. 2017 Apr;2017:1525-1532. doi: 10.1109/ICDE.2017.223. Epub 2017 May 18.
The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.
高维生物数据的指数级增长导致对知识生产自动化方法的需求迅速增加。现有方法依靠两种通用方法来应对这一挑战:1)理论驱动方法,该方法利用先前积累的知识;2)数据驱动方法,该方法仅利用数据来推导科学知识。这两种方法单独使用都存在对过去/当前知识的偏见,因为它们未能纳入所有可用于做出新发现的现有知识。在本文中,我们展示了一种集成方法如何有效解决大型生物数据的高维度问题,这是纯数据驱动分析方法的一个主要问题。我们在一种新颖的两步分析工作流程中实现了我们的方法,该流程将一种新的特征选择范式作为处理高通量基因表达数据分析的第一步,并利用图形因果建模作为第二步来处理因果关系的自动提取。我们在来自癌症基因组图谱(TCGA)的真实临床数据集上的结果表明,我们的方法能够智能地选择基因以学习有效的因果网络。