Institut Pasteur, Unité de Pathogénie Virale, Paris, France.
PLoS One. 2011;6(9):e24085. doi: 10.1371/journal.pone.0024085. Epub 2011 Sep 9.
Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCube®, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992-2003, aged 1-5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection ≤10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCube® rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCube® efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.
在基因组时代,复杂的高维数据集带来了重大的分析挑战。这种数据集不仅限于基因分析,也与流行病学相关。人们已经付出了相当大的努力来开发无假设的数据挖掘和机器学习方法。然而,当前的方法缺乏完备性和普遍适用性。在这里,我们使用一种新颖的非参数、非欧几里得数据挖掘工具 HyperCube®,通过在 m 维空间中搜索事件的过密度,来彻底探索一个复杂的流行病学疟疾数据集。过密度的热点对应于变量的字符串,规则决定了在这种情况下,恶性疟原虫临床疟疾发作的发生。该数据集包含了 46837 个结果事件,来自 1653 个人和 34 个解释变量。最好的预测规则包含了 1689 个事件,来自 148 个人,定义为:1992-2003 年期间在场的人,年龄在 1-5 岁之间,血红蛋白为 AA,以前曾感染过 10 次以下的间日疟原虫。这些人发生恶性疟原虫临床疟疾发作的风险比一般人群高 3.71 倍。我们在两个不同的队列中验证了这个规则。我们比较和对比了 HyperCube®规则与使用传统统计方法和非参数回归树方法确定的变量的规则。此外,我们还尝试了所有可能的子分类定量变量。没有其他具有同等或更高代表性的模型给出了更高的相对风险。虽然规则中的四个变量中有三个是直观的,但疟原虫感染次数的影响却不是。HyperCube®有效地对子分类定量变量进行细分,以优化规则,并能够识别变量之间的相互作用,这是使用标准数据挖掘方法不容易完成的任务。在 m 维空间中搜索局部过密度,并通过易于解释的规则来解释,因此似乎是为大型数据集生成假设以揭示生物系统固有的复杂性的理想选择。