Suppr超能文献

几种删失数据分析法的比较。

A comparison of several methods for analyzing censored data.

作者信息

Hewett Paul, Ganser Gary H

机构信息

Exposure Assessment Solutions, Inc., Morgantown, West Virginia, USA.

出版信息

Ann Occup Hyg. 2007 Oct;51(7):611-32. doi: 10.1093/annhyg/mem045.

Abstract

The purpose of this study was to compare the performance of several methods for statistically analyzing censored datasets [i.e. datasets that contain measurements that are less than the field limit-of-detection (LOD)] when estimating the 95th percentile and the mean of right-skewed occupational exposure data. The methods examined were several variations on the maximum likelihood estimation (MLE) and log-probit regression (LPR) methods, the common substitution methods, several non-parametric (NP) quantile methods for the 95th percentile and the NP Kaplan-Meier (KM) method. Each method was challenged with computer-generated censored datasets for a variety of plausible scenarios where the following factors were allowed to vary randomly within fairly wide ranges: the true geometric standard deviation, the censoring point or LOD and the sample size. This was repeated for both a single-laboratory scenario (i.e. single LOD) and a multiple-laboratory scenario (i.e. three LODs) as well as a single lognormal distribution scenario and a contaminated lognormal distribution scenario. Each method was used to estimate the 95th percentile and mean for the censored datasets (the NP quantile methods estimated only the 95th percentile). For each scenario, the method bias and overall imprecision (as indicated by the root mean square error or rMSE) were calculated for the 95th percentile and mean. No single method was unequivocally superior across all scenarios, although nearly all of the methods excelled in one or more scenarios. Overall, only the MLE- and LPR-based methods performed well across all scenarios, with the robust versions generally showing less bias than the standard versions when challenged with a contaminated lognormal distribution and multiple LODs. All of the MLE- and LPR-based methods were remarkably robust to departures from the lognormal assumption, nearly always having lower rMSE values than the NP methods for the exposure scenarios postulated. In general, the MLE methods tended to have smaller rMSE values than the LPR methods, particularly for the small sample size scenarios. The substitution methods tended to be strongly biased, but in some scenarios had the smaller rMSE values, especially for sample sizes <20. Surprisingly, the various NP methods were not as robust as expected, performing poorly in the contaminated distribution scenarios for both the 95th percentile and the mean. In conclusion, when using the rMSE rather than bias as the preferred comparison metric, the standard MLE method consistently outperformed the so-called robust variations of the MLE-based and LPR-based methods, as well as the various NP methods, for both the 95th percentile and the mean. When estimating the mean, the standard LPR method tended to outperform the robust LPR-based methods. Whenever bias is the main consideration, the robust MLE-based methods should be considered. The KM method, currently hailed by some as the preferred method for estimating the mean when the lognormal distribution assumption is questioned, did not perform well for either the 95th percentile or mean and is not recommended.

摘要

本研究的目的是比较几种统计分析删失数据集(即包含小于检测限(LOD)的测量值的数据集)的方法在估计第95百分位数和右偏态职业暴露数据均值时的性能。所考察的方法包括最大似然估计(MLE)和对数概率回归(LPR)方法的几种变体、常用替代方法、几种用于第95百分位数的非参数(NP)分位数方法以及NP Kaplan-Meier(KM)方法。每种方法都通过计算机生成的删失数据集在各种合理场景下进行测试,在这些场景中,以下因素被允许在相当宽的范围内随机变化:真实几何标准差、删失点或LOD以及样本量。针对单实验室场景(即单个LOD)和多实验室场景(即三个LOD)以及单对数正态分布场景和污染对数正态分布场景都重复了这一过程。每种方法都用于估计删失数据集的第95百分位数和均值(NP分位数方法仅估计第95百分位数)。对于每个场景,计算第95百分位数和均值的方法偏差和总体不精密度(以均方根误差或rMSE表示)。虽然几乎所有方法在一个或多个场景中表现出色,但没有一种方法在所有场景中都绝对优于其他方法。总体而言,只有基于MLE和LPR的方法在所有场景中都表现良好,在面对污染对数正态分布和多个LOD时,稳健版本通常比标准版本偏差更小。所有基于MLE和LPR的方法对于偏离对数正态假设都具有显著的稳健性,在假设的暴露场景中,其rMSE值几乎总是低于NP方法。一般来说,MLE方法的rMSE值往往比LPR方法小,特别是在小样本量场景中。替代方法往往存在强烈偏差,但在某些场景中rMSE值较小,尤其是对于样本量<20的情况。令人惊讶的是,各种NP方法并不像预期的那样稳健,在污染分布场景中对于第95百分位数和均值的表现都很差。总之,当使用rMSE而非偏差作为首选比较指标时,标准MLE方法在第95百分位数和均值方面始终优于基于MLE和LPR的方法的所谓稳健变体以及各种NP方法。在估计均值时,标准LPR方法往往优于基于LPR的稳健方法。每当偏差是主要考虑因素时,应考虑基于MLE的稳健方法。目前被一些人誉为在对数正态分布假设受到质疑时估计均值的首选方法的KM方法,在第95百分位数或均值方面表现都不佳,不建议使用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验