Glielmo Aldo, Zeni Claudio, Cheng Bingqing, Csányi Gábor, Laio Alessandro
Physics Department, International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy.
Bank of Italy, 00187, Italy.
PNAS Nexus. 2022 Apr 14;1(2):pgac039. doi: 10.1093/pnasnexus/pgac039. eCollection 2022 May.
Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.
现实世界的数据通常包含大量特征,这些特征在性质、相关性以及度量单位方面往往是异质的。在评估数据点之间的相似性时,可以使用这些特征的子集构建各种距离度量。找到一小组仍然保留有关数据集足够信息的特征对于许多统计学习方法的成功应用至关重要。我们引入一种统计检验,该检验可以评估使用两种不同距离度量时保留的相对信息,并确定它们是否等效、独立,或者其中一种是否比另一种更具信息性。这种排序反过来可用于从一组候选特征中识别出最具信息性的距离度量,从而确定最具信息性的特征集。为了说明我们方法的普遍适用性,我们表明它再现了政策变量对新冠疫情控制的已知重要性排名,并且还识别出了原子结构紧凑但信息丰富的描述符。我们进一步提供了初步证据,表明通过所提出的检验测量的信息不对称可用于推断数据集特征之间的因果关系。该方法具有通用性,应该适用于许多科学分支。