Mahon Louis
School of Informatics, Edinburgh University, Edinburgh EH8 9YL, UK.
Entropy (Basel). 2025 Mar 25;27(4):339. doi: 10.3390/e27040339.
Data complexity is an important concept in the natural sciences and related areas, but lacks a rigorous and computable definition. This paper focusses on a particular sense of complexity that is high if the data is structured in a way that could serve to communicate a message. In this sense, human speech, written language, drawings, diagrams and photographs are high complexity, whereas data that is close to uniform throughout or populated by random values is low complexity. I describe a general framework for measuring data complexity based on dividing the shortest description of the data into a structured and an unstructured portion, and taking the size of the former as the complexity score. I outline an application of this framework in statistical mechanics that may allow a more objective characterisation of the macrostate and entropy of a physical system. Then, I derive a more precise and computable definition geared towards human communication, by proposing local compositionality as an appropriate specific structure. Experimental evaluation shows that this method can distinguish meaningful signals from noise or repetitive signals in auditory, visual and text domains, and could potentially help determine whether an extra-terrestrial signal contained a message.
数据复杂性是自然科学及相关领域中的一个重要概念,但缺乏严谨且可计算的定义。本文关注的是一种特定意义上的复杂性:如果数据的结构化方式有助于传达信息,那么这种复杂性就高。从这个意义上讲,人类语言、书面文字、绘画、图表和照片具有高复杂性,而几乎完全均匀或由随机值构成的数据则具有低复杂性。我描述了一个用于测量数据复杂性的通用框架,该框架基于将数据的最短描述划分为结构化部分和非结构化部分,并以前者的大小作为复杂性得分。我概述了此框架在统计力学中的应用,这可能使对物理系统宏观状态和熵的表征更加客观。然后,通过提出局部组合性作为一种合适的特定结构,我得出了一个更精确且可计算的、针对人类通信的定义。实验评估表明,该方法能够在听觉、视觉和文本领域中将有意义的信号与噪声或重复信号区分开来,并且有可能帮助确定外星信号是否包含信息。