Tomova Georgia D, Walmsley Rosemary, Berrie Laurie, Morris Michelle A, Tennant Peter W G
The Alan Turing Institute, British Library, 96 Euston Road, London, NW1 2DB, UK.
Leeds Institute for Data Analytics, University of Leeds, Leeds, LS2 9NL, UK.
BMC Med Res Methodol. 2025 Apr 17;25(1):100. doi: 10.1186/s12874-025-02509-1.
Compositional data comprise the parts of a 'whole' (or 'total'), which sum to that 'whole'. The 'whole' may vary between units of analyses, or it may be fixed (constant). For example, total energy intake (a variable total) is the sum of intake from all foods or macronutrients. Total time in a day (a fixed total) is the sum of time spent engaging in various activities. There exist different approaches to analysing compositional data, such as the isocaloric or isotemporal model, ratio variables, and compositional data analysis (CoDA). Although the performance of the different approaches has been compared previously, this has only been conducted in real data. Since the true relationships are unknown in real data, it is difficult to compare model performance in estimating a known effect. We use data simulations of different parametric relationships, to explore and demonstrate the performance of each approach under various possible conditions.
We simulated physical activity time-use and dietary data as examples of compositional data with fixed and variable totals, respectively, using different parametric relationships between the compositional components and the outcome (fasting plasma glucose): linear, log, and isometric log-ratios. We evaluated the performance of a range of generalised linear and additive models as well as CoDA, in estimating a 1-unit and either 10-unit (for physical activity) or 100-unit (for dietary data) reallocations under each parametric scenario. We simulated 10,000 datasets with 1,000 observations in each.
The performance of each approach to analysing compositional data depends on how closely its parameterisation matches the true data generating process. Overall, we demonstrated that the consequences of using an incorrect parameterisation (e.g. using CoDA when the true relationship is linear) are more severe for larger reallocations (e.g. 10-min or 100-kcal) than for 1-unit reallocations. The implications of choosing an unsuitable approach may be starker in compositional data with variable totals. For example, while models with ratio variables are mathematically equivalent to linear models in compositional data with fixed totals, their estimates may be radically different for variable totals.
Compositional data with fixed and variable totals behave differently. All existing approaches to analysing such data have utility but need to be carefully selected. Investigators should explore the shape of the relationships between the compositional components and the outcome and chose an approach that matches it best.
构成数据包含一个“整体”(或“总量”)的各个部分,这些部分的总和等于该“整体”。“整体”在不同分析单位之间可能会有所不同,或者它可能是固定的(常量)。例如,总能量摄入(一个可变总量)是所有食物或宏量营养素摄入量的总和。一天中的总时间(一个固定总量)是从事各种活动所花费时间的总和。存在不同的方法来分析构成数据,如等热量或等时间模型、比率变量以及构成数据分析(CoDA)。尽管之前已经比较了不同方法的性能,但这仅在真实数据中进行。由于真实数据中的真实关系未知,因此在估计已知效应时很难比较模型性能。我们使用具有不同参数关系的数据模拟,以探索和展示每种方法在各种可能条件下的性能。
我们分别模拟了身体活动时间使用数据和饮食数据作为具有固定总量和可变总量的构成数据示例,使用构成成分与结果(空腹血糖)之间的不同参数关系:线性、对数和等距对数比率。我们评估了一系列广义线性模型、加法模型以及CoDA在估计每个参数场景下1个单位以及10个单位(针对身体活动)或100个单位(针对饮食数据)重新分配时的性能。我们模拟了10000个数据集,每个数据集有1000个观测值。
分析构成数据的每种方法的性能取决于其参数化与真实数据生成过程的匹配程度。总体而言,我们证明了对于较大的重新分配(例如10分钟或100千卡),使用错误参数化(例如当真实关系为线性时使用CoDA)的后果比1个单位重新分配时更为严重。在具有可变总量的构成数据中,选择不合适方法的影响可能更为明显。例如,虽然在具有固定总量的构成数据中,比率变量模型在数学上等同于线性模型,但对于可变总量,它们的估计可能会有很大差异。
具有固定总量和可变总量的构成数据表现不同。所有现有的分析此类数据的方法都有其用途,但需要仔细选择。研究人员应探索构成成分与结果之间关系的形状,并选择最匹配的方法。