Decelle Aurélien, Seoane Beatriz, Rosset Lorenzo
Departamento de Física Teórica, Universidad Complutense de Madrid, 28040 Madrid, Spain and Université Paris-Saclay, CNRS, INRIA Tau team, LISN, 91190 Gif-sur-Yvette, France.
Departamento de Física Teórica, Universidad Complutense de Madrid, 28040 Madrid, Spain.
Phys Rev E. 2023 Jul;108(1-1):014110. doi: 10.1103/PhysRevE.108.014110.
Data sets in the real world are often complex and to some degree hierarchical, with groups and subgroups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these data sets is an important task that has many practical applications. To address this challenge, we present a general method for building relational data trees by exploiting the learning dynamics of the restricted Boltzmann machine. Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical data set and on three different real-world data sets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.
现实世界中的数据集通常很复杂,且在某种程度上具有层次性,数据的组和子组在不同抽象层次上共享共同特征。理解和揭示这些数据集的隐藏结构是一项具有许多实际应用的重要任务。为应对这一挑战,我们提出了一种通过利用受限玻尔兹曼机的学习动态来构建关系数据树的通用方法。我们的方法基于平均场方法,该方法源自普列夫卡展开,并在无序系统的背景下发展而来。它旨在易于解释。我们在一个人工创建的层次数据集中以及三个不同的真实世界数据集(数字图像、人类基因组中的突变和一个蛋白质同源家族)上测试了我们的方法。该方法能够自动识别数据的层次结构。这在同源蛋白质序列的研究中可能会很有用,其中蛋白质之间的关系对于理解它们的功能和进化至关重要。