Yin Weiwei, Kissinger Jessica C, Moreno Alberto, Galinski Mary R, Styczynski Mark P
School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0100, USA.
Department of Genetics, Institute of Bioinformatics, Center for Tropical and Emerging Global Diseases, University of Georgia, Athens, GA, USA.
Math Biosci. 2015 Dec;270(Pt B):156-68. doi: 10.1016/j.mbs.2015.06.006. Epub 2015 Jun 17.
High-throughput, genome-scale data present a unique opportunity to link host to pathogen on a molecular level. Forging such connections will help drive the development of mathematical models to better understand and predict both pathogen behavior and the epidemiology of infectious diseases, including malaria. However, the datasets that can aid in identifying these links and models are vast and not amenable to simple, reductionist, and univariate analyses. These datasets require data mining in order to identify the truly important measurements that best describe clinical and molecular observations. Moreover, these datasets typically have relatively few samples due to experimental limitations (particularly for human studies or in vivo animal experiments), making data mining extremely difficult. Here, after first providing a brief overview of common strategies for data reduction and identification of relationships between variables for inclusion in mathematical models, we present a new generalized strategy for performing these data reduction and relationship inference tasks. Our approach emphasizes the importance of robustness when using data to drive model development, particularly when using genome-scale, small-sample in vivo data. We identify the use of appropriate feature reduction combined with data permutations and subsampling strategies as being critical to enable increasingly robust results from network inference using high-dimensional, low-observation data.
高通量、全基因组规模的数据提供了一个在分子水平上连接宿主与病原体的独特机会。建立这样的联系将有助于推动数学模型的发展,以便更好地理解和预测病原体行为以及包括疟疾在内的传染病流行病学。然而,有助于识别这些联系和模型的数据集非常庞大,不适合进行简单、还原论和单变量分析。这些数据集需要进行数据挖掘,以识别最能描述临床和分子观察结果的真正重要的测量值。此外,由于实验限制(特别是对于人体研究或体内动物实验),这些数据集通常样本相对较少,这使得数据挖掘极其困难。在此,我们首先简要概述了数据简化以及识别纳入数学模型的变量之间关系的常见策略,然后提出一种执行这些数据简化和关系推断任务的新通用策略。我们的方法强调在使用数据推动模型开发时稳健性的重要性,特别是在使用全基因组规模、小样本体内数据时。我们确定,结合数据排列和子采样策略使用适当的特征约简对于利用高维、低观测数据进行网络推断从而获得日益稳健的结果至关重要。