Taleb Nassim Nicholas, Zalloua Pierre, Elbassioni Khaled, Hatzikirou Haralampos, Henschel Andreas, Platt Daniel E
Risk Engineering, School of Engineering, New York, USA.
Maroun Semaan Faculty of Engineering and Architecture, American University of Beirut, Beirut, Lebanon.
Comput Struct Biotechnol J. 2024 Dec 11;27:48-56. doi: 10.1016/j.csbj.2024.11.042. eCollection 2025.
Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as "bits". We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.
主成分分析(PCA)是一种强大的多变量工具,可将数据投影到低维表示中。然而,这些低维投影上的数据点距离难以解释。在此,我们提出一种计算简单的启发式方法,将基于标准PCA(当变量渐近高斯分布时)的映射转换为基于熵的映射,其中距离基于互信息(MI)。此外,我们表明在某些情况下,我们提出的缩放PCA可以改善聚类识别。当像在遗传学中那样将基于主成分的距离用MI重新缩放时,会得到相对统计关联的表示,这是应用于个体基因组互信息之间的比特测量时的情况。这种熵重新缩放的PCA在保留顺序关系(沿一个维度)的同时,将相对距离量化为信息单位,如“比特”。我们使用来自世界人群的基因组数据说明了这种重新缩放的效果,并描述了结果的解释是如何受到影响的。