处理化学数据中缺失和删失元素的主成分分析实用方法。

Practical approaches to principal component analysis for simultaneously dealing with missing and censored elements in chemical data.

机构信息

Department of Theoretical Chemistry, Institute of Chemistry, The University of Silesia, 9 Szkolna Street, 40-006 Katowice, Poland.

出版信息

Anal Chim Acta. 2013 Sep 24;796:27-37. doi: 10.1016/j.aca.2013.08.026. Epub 2013 Aug 20.

DOI:10.1016/j.aca.2013.08.026

PMID:24016579

Abstract

Multivariate chemical data often contain elements that are missing completely at random and the so-called left-censored elements whose values are only known to be below a definite threshold value (reporting limit). In the last several years, attention has been paid to developing methods for dealing with data containing missing elements and those that can handle data with missing elements and outliers. However, processing data with both missing and left-censored elements is still an ongoing problem. The aim of this work was to investigate which method is most suitable for handling left-censored and missing completely at random elements that are present simultaneously in chemical data by using a comparison of the generalized nonlinear iterative partial least squares (NIPALS(1)) algorithm that has been recently proposed, methods that include uncertainty information like maximum likelihood principal component analysis, MLPCA(2), and replacement methods. The results of the Monte Carlo simulation study for artificial and real data sets showed that substitution with half of the reporting limit can be used when the percentage of left-censored elements per variable is up to 30-40%. The generalized NIPALS algorithm is generally recommended for a large percentage of left-censored elements per variable and particularly when a large number of variables are censored. The expectation-maximization approach applied to data with censored elements substituted with half of the reporting limits can be a strategy for dealing with missing and left-censored elements in data, but if the converge criterion is not fulfilled, then the generalized NIPALS algorithm can be applied.

摘要

多元化学数据通常包含完全随机缺失的元素和所谓的左截断元素，其值仅知低于一定的阈值（报告限）。在过去的几年中，人们一直关注开发处理包含缺失元素和能够处理包含缺失元素和异常值的数据的方法。然而，处理同时包含缺失和左截断元素的数据仍然是一个正在进行的问题。本工作的目的是通过使用最近提出的广义非线性迭代偏最小二乘（NIPALS(1)）算法的比较，研究哪种方法最适合处理化学数据中同时存在的左截断和完全随机缺失元素，该算法包括不确定性信息，如最大似然主成分分析、MLPCA(2)和替换方法。对人工和真实数据集的蒙特卡罗模拟研究结果表明，当每个变量的左截断元素百分比高达 30-40%时，可以使用报告限的一半进行替换。当每个变量的左截断元素百分比较大时，通常推荐使用广义 NIPALS 算法，特别是当大量变量被截断时。应用于用报告限的一半替换的带有截断元素的数据的期望最大化方法可以是处理数据中缺失和左截断元素的策略，但是如果不满足收敛标准，则可以应用广义 NIPALS 算法。