Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Science. 2011 Dec 16;334(6062):1518-24. doi: 10.1126/science.1205438.
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
在大型数据集之间识别变量对之间有趣的关系变得越来越重要。在这里,我们提出了一种用于双变量关系的依赖度量:最大信息系数(MIC)。MIC 捕捉了广泛的关联,包括功能和非功能关系,对于功能关系,它提供了一个大致等于数据相对于回归函数的确定系数(R^2)的分数。MIC 属于一类更大的基于最大信息量的非参数探索(MINE)统计量,用于识别和分类关系。我们将 MIC 和 MINE 应用于全球健康、基因表达、大联盟棒球和人类肠道微生物组的数据集中,并识别出已知和新的关系。