University of Alberta School of Public Health, Edmonton, AB, Canada.
Department of Computing Science, University of Alberta, Edmonton, AB, Canada.
BMC Med Inform Decis Mak. 2019 Jun 17;19(1):112. doi: 10.1186/s12911-019-0838-4.
Data mining tools have been increasingly used in health research, with the promise of accelerating discoveries. Lift is a standard association metric in the data mining community. However, health researchers struggle with the interpretation of lift. As a result, dissemination of data mining results can be met with hesitation. The relative risk and odds ratio are standard association measures in the health domain, due to their straightforward interpretation and comparability across populations. We aimed to investigate the lift-relative risk and the lift-odds ratio relationships, and provide tools to convert lift to the relative risk and odds ratio.
We derived equations linking lift-relative risk and lift-odds ratio. We discussed how lift, relative risk, and odds ratio behave numerically with varying association strengths and exposure prevalence levels. The lift-relative risk relationship was further illustrated using a high-dimensional dataset which examines the association of exposure to airborne pollutants and adverse birth outcomes. We conducted spatial association rule mining using the Kingfisher algorithm, which identified association rules using its built-in lift metric. We directly estimated relative risks and odds ratios from 2 by 2 tables for each identified rule. These values were compared to the corresponding lift values, and relative risks and odds ratios were computed using the derived equations.
As the exposure-outcome association strengthens, the odds ratio and relative risk move away from 1 faster numerically than lift, i.e. |log (odds ratio)| ≥ |log (relative risk)| ≥ |log (lift)|. In addition, lift is bounded by the smaller of the inverse probability of outcome or exposure, i.e. lift≤ min (1/P(O), 1/P(E)). Unlike the relative risk and odds ratio, lift depends on the exposure prevalence for fixed outcomes. For example, when an exposure A and a less prevalent exposure B have the same relative risk for an outcome, exposure A has a lower lift than B.
Lift, relative risk, and odds ratio are positively correlated and share the same null value. However, lift depends on the exposure prevalence, and thus is not straightforward to interpret or to use to compare association strength. Tools are provided to obtain the relative risk and odds ratio from lift.
数据挖掘工具在健康研究中得到了越来越多的应用,有望加速发现。提升是数据挖掘领域中的一个标准关联度量。然而,健康研究人员在解释提升时遇到了困难。因此,数据挖掘结果的传播可能会犹豫不决。相对风险和优势比是健康领域的标准关联度量,因为它们的解释简单,并且在不同人群之间具有可比性。我们旨在研究提升-相对风险和提升-优势比之间的关系,并提供将提升转换为相对风险和优势比的工具。
我们推导出了将提升-相对风险和提升-优势比联系起来的方程。我们讨论了在不同关联强度和暴露流行水平下,提升、相对风险和优势比在数值上的表现。我们使用一个高维数据集进一步说明了提升-相对风险关系,该数据集研究了暴露于空气污染物与不良出生结果之间的关联。我们使用 Kingfisher 算法进行空间关联规则挖掘,该算法使用其内置的提升度量来识别关联规则。我们直接从每个识别出的规则的 2x2 表中估计相对风险和优势比。将这些值与相应的提升值进行比较,并使用推导的方程计算相对风险和优势比。
随着暴露-结果关联的增强,优势比和相对风险在数值上比提升更快地远离 1,即|log(优势比)|≥|log(相对风险)|≥|log(提升)|。此外,提升受结果或暴露的逆概率较小限制,即提升≤min(1/P(O),1/P(E))。与相对风险和优势比不同,提升取决于固定结果的暴露流行率。例如,当暴露 A 和不太流行的暴露 B 对结果具有相同的相对风险时,暴露 A 的提升低于 B。
提升、相对风险和优势比呈正相关,具有相同的零值。然而,提升取决于暴露流行率,因此解释起来并不简单,也不便于用于比较关联强度。提供了从提升中获得相对风险和优势比的工具。