Physical Sciences of Imaging in the Biomedical Sciences Doctoral Training Centre, School of Chemistry, University of Birmingham, Edgbaston, Birmingham, United Kingdom.
Anal Chem. 2013 Mar 19;85(6):3071-8. doi: 10.1021/ac302528v. Epub 2013 Mar 6.
A memory efficient algorithm for the computation of principal component analysis (PCA) of large mass spectrometry imaging data sets is presented. Mass spectrometry imaging (MSI) enables two- and three-dimensional overviews of hundreds of unlabeled molecular species in complex samples such as intact tissue. PCA, in combination with data binning or other reduction algorithms, has been widely used in the unsupervised processing of MSI data and as a dimentionality reduction method prior to clustering and spatial segmentation. Standard implementations of PCA require the data to be stored in random access memory. This imposes an upper limit on the amount of data that can be processed, necessitating a compromise between the number of pixels and the number of peaks to include. With increasing interest in multivariate analysis of large 3D multislice data sets and ongoing improvements in instrumentation, the ability to retain all pixels and many more peaks is increasingly important. We present a new method which has no limitation on the number of pixels and allows an increased number of peaks to be retained. The new technique was validated against the MATLAB (The MathWorks Inc., Natick, Massachusetts) implementation of PCA (princomp) and then used to reduce, without discarding peaks or pixels, multiple serial sections acquired from a single mouse brain which was too large to be analyzed with princomp. Then, k-means clustering was performed on the reduced data set. We further demonstrate with simulated data of 83 slices, comprising 20,535 pixels per slice and equaling 44 GB of data, that the new method can be used in combination with existing tools to process an entire organ. MATLAB code implementing the memory efficient PCA algorithm is provided.
提出了一种用于计算大型质谱成像数据集主成分分析(PCA)的内存高效算法。质谱成像(MSI)能够在复杂样本(如完整组织)中对数百种未标记的分子物种进行二维和三维概述。PCA 与数据-bin 或其他降维算法相结合,已广泛应用于 MSI 数据的无监督处理,以及在聚类和空间分割之前作为降维方法。PCA 的标准实现要求数据存储在随机存取存储器中。这对可以处理的数据量施加了上限,因此需要在像素数量和要包含的峰数量之间进行折衷。随着对大型 3D 多切片数据集的多元分析的兴趣增加以及仪器的不断改进,保留所有像素和更多峰的能力变得越来越重要。我们提出了一种新方法,该方法对像素数量没有限制,并允许保留更多的峰。新方法通过与 MATLAB(马萨诸塞州纳蒂克的 The MathWorks Inc.)实现的 PCA(princomp)进行验证,然后用于减少单个鼠标大脑的多个连续切片,而无需丢弃峰或像素,该大脑太大而无法用 princomp 进行分析。然后,在缩减的数据集中执行 k-均值聚类。我们进一步用包含 83 个切片、每个切片包含 20,535 个像素、总计 44GB 数据的模拟数据证明,新方法可以与现有工具结合使用来处理整个器官。提供了实现内存高效 PCA 算法的 MATLAB 代码。