Centre for Data and Knowledge Integration for Health, Gonçalo Moniz Institute, Oswaldo Cruz Foundation, Salvador, BA, Brazil.
ISGlobal, Hospital Clínic. Universitat de Barcelona, Barcelona, Spain.
BMC Med Res Methodol. 2024 Feb 15;24(1):38. doi: 10.1186/s12874-024-02161-1.
Several strategies for identifying biologically implausible values in longitudinal anthropometric data have recently been proposed, but the suitability of these strategies for large population datasets needs to be better understood. This study evaluated the impact of removing population outliers and the additional value of identifying and removing longitudinal outliers on the trajectories of length/height and weight and on the prevalence of child growth indicators in a large longitudinal dataset of child growth data.
Length/height and weight measurements of children aged 0 to 59 months from the Brazilian Food and Nutrition Surveillance System were analyzed. Population outliers were identified using z-scores from the World Health Organization (WHO) growth charts. After identifying and removing population outliers, residuals from linear mixed-effects models were used to flag longitudinal outliers. The following cutoffs for residuals were tested to flag those: -3/+3, -4/+4, -5/+5, -6/+6. The selected child growth indicators included length/height-for-age z-scores and weight-for-age z-scores, classified according to the WHO charts.
The dataset included 50,154,738 records from 10,775,496 children. Boys and girls had 5.74% and 5.31% of length/height and 5.19% and 4.74% of weight values flagged as population outliers, respectively. After removing those, the percentage of longitudinal outliers varied from 0.02% (<-6/>+6) to 1.47% (<-3/>+3) for length/height and from 0.07 to 1.44% for weight in boys. In girls, the percentage of longitudinal outliers varied from 0.01 to 1.50% for length/height and from 0.08 to 1.45% for weight. The initial removal of population outliers played the most substantial role in the growth trajectories as it was the first step in the cleaning process, while the additional removal of longitudinal outliers had lower influence on those, regardless of the cutoff adopted. The prevalence of the selected indicators were also affected by both population and longitudinal (to a lesser extent) outliers.
Although both population and longitudinal outliers can detect biologically implausible values in child growth data, removing population outliers seemed more relevant in this large administrative dataset, especially in calculating summary statistics. However, both types of outliers need to be identified and removed for the proper evaluation of trajectories.
最近提出了几种识别纵向人体测量数据中生物学上不合理值的策略,但这些策略在大型人群数据集上的适用性仍需进一步研究。本研究旨在评估在一个大型儿童生长数据的纵向数据集,通过剔除人群离群值和识别并剔除纵向离群值,对长度/身高和体重的轨迹以及儿童生长指标的流行率产生的影响。
分析了来自巴西食品和营养监测系统(Food and Nutrition Surveillance System)的 0 至 59 月龄儿童的长度/身高和体重测量值。使用世界卫生组织(WHO)生长图表的 z 分数来识别人群离群值。在识别和剔除人群离群值后,使用线性混合效应模型的残差来标记纵向离群值。测试了以下残差截断值来标记离群值:-3/+3、-4/+4、-5/+5、-6/+6。选择的儿童生长指标包括按 WHO 图表分类的长度/身高年龄 z 评分和体重年龄 z 评分。
该数据集包含了 50154738 条记录,涉及 10775496 名儿童。男孩和女孩的长度/身高值分别有 5.74%和 5.31%、体重值分别有 5.19%和 4.74%被标记为人群离群值。剔除这些离群值后,男孩的长度/身高和体重的纵向离群值百分比分别从 0.02%(<-6/>+6)到 1.47%(<-3/>+3)不等,而女孩的长度/身高和体重的纵向离群值百分比分别从 0.01%到 1.50%和 0.08%到 1.45%不等。在男孩和女孩中,纵向离群值的百分比在 0.01%到 1.50%之间。对于长度/身高和 0.08%到 1.45%之间的体重,初始剔除人群离群值对生长轨迹的影响最大,因为它是清洁过程的第一步,而剔除纵向离群值的影响较小,无论采用何种截断值。所选指标的流行率也受到人群和纵向(影响较小)离群值的影响。
尽管人群和纵向离群值都可以检测到儿童生长数据中的生物学上不合理值,但在这个大型行政数据集上,剔除人群离群值似乎更为重要,尤其是在计算汇总统计数据时。然而,为了正确评估轨迹,这两种类型的离群值都需要被识别和剔除。