Sakhanenko Nikita A, Kunert-Graf James, Galas David J
Pacific Northwest Research Institute , Seattle, Washington.
J Comput Biol. 2017 Dec;24(12):1153-1178. doi: 10.1089/cmb.2017.0143. Epub 2017 Oct 13.
The complex of central problems in data analysis consists of three components: (1) detecting the dependence of variables using quantitative measures, (2) defining the significance of these dependence measures, and (3) inferring the functional relationships among dependent variables. We have argued previously that an information theory approach allows separation of the detection problem from the inference of functional form problem. We approach here the third component of inferring functional forms based on information encoded in the functions. We present here a direct method for classifying the functional forms of discrete functions of three variables represented in data sets. Discrete variables are frequently encountered in data analysis, both as the result of inherently categorical variables and from the binning of continuous numerical variables into discrete alphabets of values. The fundamental question of how much information is contained in a given function is answered for these discrete functions, and their surprisingly complex relationships are illustrated. The all-important effect of noise on the inference of function classes is found to be highly heterogeneous and reveals some unexpected patterns. We apply this classification approach to an important area of biological data analysis-that of inference of genetic interactions. Genetic analysis provides a rich source of real and complex biological data analysis problems, and our general methods provide an analytical basis and tools for characterizing genetic problems and for analyzing genetic data. We illustrate the functional description and the classes of a number of common genetic interaction modes and also show how different modes vary widely in their sensitivity to noise.
(1)使用定量方法检测变量之间的依赖性;(2)定义这些依赖性度量的显著性;(3)推断因变量之间的函数关系。我们之前认为,信息论方法可以将检测问题与函数形式的推断问题分开。我们在此基于函数中编码的信息来处理推断函数形式的第三个部分。我们在此提出一种直接方法,用于对数据集中表示的三个变量的离散函数的函数形式进行分类。在数据分析中经常会遇到离散变量,这既是固有分类变量的结果,也是连续数值变量被划分为离散值字母表的结果。对于这些离散函数,回答了给定函数中包含多少信息这一基本问题,并说明了它们惊人的复杂关系。发现噪声对函数类推断的极其重要的影响高度异质,并揭示了一些意想不到的模式。我们将这种分类方法应用于生物数据分析的一个重要领域——基因相互作用的推断。遗传分析为真实且复杂的生物数据分析问题提供了丰富的来源,我们的通用方法为表征遗传问题和分析遗传数据提供了分析基础和工具。我们说明了一些常见遗传相互作用模式的功能描述和类别,还展示了不同模式对噪声的敏感度差异很大。