Zink Richard C, Castro-Schilo Laura, Ding Jianfeng
TARGET PharmaSolutions; Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
JMP Division, SAS Institute, Cary, NC, USA.
Pharm Stat. 2018 Nov;17(6):846-853. doi: 10.1002/pst.1903. Epub 2018 Sep 26.
Mahalanobis distance is often recommended to identify patients or clinical sites that are considered unusual in clinical trials. Patients extreme in one or more covariates may be considered outliers in that they reside some distance from the multivariate mean, which can be thought of as the center of the data cloud. Less often discussed, patients whose data are believed to be "too good to be true" are located near the centroid as inliers. In order to efficiently investigate these anomalies for potential lapses in data quality, it is important to understand how the individual variables contribute to each multivariate outlier. There is a lack of literature describing a reasonable workflow for identification of outliers and their subsequent investigation to understand how each variable contributes to an observation being considered extreme. We describe how to identify multivariate inliers and outliers, classify outliers according to varying levels of severity, and summarize the contributions of variables using principal components in a manner that is accessible to a wide audience with straightforward interpretation. We illustrate how numerous data visualizations, including Pareto plots, can facilitate further review even in studies containing numerous observations and variables. We illustrate these methodologies using data from a multicenter clinical trial.
马氏距离通常被推荐用于识别在临床试验中被认为异常的患者或临床站点。在一个或多个协变量上处于极端值的患者可能被视为异常值,因为他们离多元均值有一定距离,多元均值可被视为数据云的中心。较少被讨论的是,那些数据被认为“好得难以置信”的患者作为内点位于质心附近。为了有效地调查这些异常情况以发现潜在的数据质量问题,了解各个变量如何导致每个多元异常值很重要。目前缺乏文献描述一种合理的工作流程,用于识别异常值及其后续调查,以了解每个变量如何导致一个观测值被视为极端值。我们描述了如何识别多元内点和异常值,根据严重程度的不同级别对异常值进行分类,并使用主成分以一种易于广大读者理解且解释直观的方式总结变量的贡献。我们说明了即使在包含大量观测值和变量的研究中,众多数据可视化方法(包括帕累托图)如何能够促进进一步审查。我们使用来自一项多中心临床试验的数据来说明这些方法。