Han Lei, Zhang Yu, Wan Xiu-Feng, Zhang Tong
Department of Statistics, Rutgers University.
Department of Computer Science and Engineering, Hong Kong University of Science and Technology.
KDD. 2016 Aug;2016:865-874. doi: 10.1145/2939672.2939786.
Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates/features can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, hierarchical sparse models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a generalized hierarchical sparse model (GHSM) as a generalization of the HSM models to tackle arbitrary-order interactions. The GHSM applies the ℓ penalty to all the model coefficients under a constraint that given any covariate, if none of its associated th-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the general iterative shrinkage and thresholding (GIST) and the alternating direction method of multipliers (ADMM) methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the influenza virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed methods and the learned high-order interactions have meaningful synergistic covariate patterns in the influenza virus antigenicity.
最近的统计证据表明,通过纳入原始协变量/特征之间的相互作用构建的回归模型可以显著提高生物数据的可解释性。一个主要挑战是在模型中添加高阶特征相互作用时特征空间呈指数级扩展。为了解决巨大的维度问题,通过在协变量之间的相互作用的遗传结构下强制稀疏性,开发了分层稀疏模型(HSM)。然而,现有方法仅考虑成对相互作用,使得发现重要的高阶相互作用成为一个具有挑战性的开放问题。在本文中,我们提出了一种广义分层稀疏模型(GHSM)作为HSM模型的推广,以处理任意阶相互作用。GHSM在一个约束条件下对所有模型系数应用ℓ惩罚,该约束条件是给定任何协变量,如果其相关的第阶相互作用都对回归模型没有贡献,那么其相关的高阶相互作用也不会有贡献。由此产生的目标函数是非凸的,挑战在于任意阶分层约束中出现的耦合变量,我们设计了一种有效的优化算法来直接求解它。具体来说,我们通过一般迭代收缩和阈值化(GIST)以及乘子交替方向法(ADMM)将约束中的变量解耦为三个子问题,每个子问题都被证明可以得到有效的解析解。我们在合成问题和流感病毒数据的抗原位点识别问题中评估了GHSM方法,在这些问题中我们将特征空间扩展到了五阶相互作用。实证结果证明了所提出方法的有效性和效率,并且所学习到的高阶相互作用在流感病毒抗原性方面具有有意义的协同协变量模式。