Ye Shengbin, Senftle Thomas P, Li Meng
Department of Statistics, Rice University, Houston, TX 77005.
Department of Chemical and Biomolecular Engineering, Rice University, Houston, TX 77005.
J Am Stat Assoc. 2024;119(545):81-94. doi: 10.1080/01621459.2023.2294527. Epub 2024 Feb 12.
In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary features and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS) and propose a new method to achieve unconventional dimension reduction by utilizing the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on primary features, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. To select the nonparametric module, we discuss a desired performance criterion that is uniquely induced by variable selection with OIS; in particular, we propose to employ a Bayesian Additive Regression Trees (BART)-based variable selection method. Numerical studies show superiority of the proposed method, which continues to exhibit robust performance when the input dimension is out of reach of existing methods. Our analysis of single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design.
在材料信息学这个新兴领域,一项基本任务是识别具有物理化学意义的描述符,即材料基因,这些描述符是通过组合从基本特征和一组基本代数运算符构建而成的。标准做法是在一个线性模型中直接分析高维候选预测变量空间;然而,由于样本量有限,大量相互关联的预测变量带来了艰巨挑战,这使得统计分析受到极大阻碍。我们将这个问题表述为具有算子诱导结构(OIS)的变量选择,并提出一种新方法,通过利用嵌入在OIS中的几何结构来实现非常规的降维。尽管模型仍然是线性的,但我们迭代进行非参数变量选择以实现有效的降维。这使得能够基于基本特征进行变量选择,从而得到一种比现有方法快几个数量级且精度更高的方法。为了选择非参数模块,我们讨论了一种由具有OIS的变量选择唯一诱导的期望性能标准;特别是,我们建议采用基于贝叶斯加法回归树(BART)的变量选择方法。数值研究表明了所提方法的优越性,当输入维度超出现有方法的处理能力时,该方法仍能持续展现出稳健的性能。我们对单原子催化的分析确定了能够高解释力地解释金属 - 载体对结合能的物理描述符,从而得到可解释的见解,以指导预防一个名为烧结的棘手问题并辅助催化设计。