Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, 1100 Fariview Ave N, Seattle, WA 98109, USA.
BMC Bioinformatics. 2010 Nov 4;11:546. doi: 10.1186/1471-2105-11-546.
In a high throughput setting, effective flow cytometry data analysis depends heavily on proper data preprocessing. While usual preprocessing steps of quality assessment, outlier removal, normalization, and gating have received considerable scrutiny from the community, the influence of data transformation on the output of high throughput analysis has been largely overlooked. Flow cytometry measurements can vary over several orders of magnitude, cell populations can have variances that depend on their mean fluorescence intensities, and may exhibit heavily-skewed distributions. Consequently, the choice of data transformation can influence the output of automated gating. An appropriate data transformation aids in data visualization and gating of cell populations across the range of data. Experience shows that the choice of transformation is data specific. Our goal here is to compare the performance of different transformations applied to flow cytometry data in the context of automated gating in a high throughput, fully automated setting. We examine the most common transformations used in flow cytometry, including the generalized hyperbolic arcsine, biexponential, linlog, and generalized Box-Cox, all within the BioConductor flowCore framework that is widely used in high throughput, automated flow cytometry data analysis. All of these transformations have adjustable parameters whose effects upon the data are non-intuitive for most users. By making some modelling assumptions about the transformed data, we develop maximum likelihood criteria to optimize parameter choice for these different transformations.
We compare the performance of parameter-optimized and default-parameter (in flowCore) data transformations on real and simulated data by measuring the variation in the locations of cell populations across samples, discovered via automated gating in both the scatter and fluorescence channels. We find that parameter-optimized transformations improve visualization, reduce variability in the location of discovered cell populations across samples, and decrease the misclassification (mis-gating) of individual events when compared to default-parameter counterparts.
Our results indicate that the preferred transformation for fluorescence channels is a parameter- optimized biexponential or generalized Box-Cox, in accordance with current best practices. Interestingly, for populations in the scatter channels, we find that the optimized hyperbolic arcsine may be a better choice in a high-throughput setting than current standard practice of no transformation. However, generally speaking, the choice of transformation remains data-dependent. We have implemented our algorithm in the BioConductor package, flowTrans, which is publicly available.
在高通量环境中,有效的流式细胞术数据分析在很大程度上依赖于适当的数据预处理。虽然社区已经对质量评估、异常值去除、归一化和门控等常用的预处理步骤进行了充分的审查,但数据转换对高通量分析结果的影响在很大程度上被忽视了。流式细胞术测量可以跨越几个数量级,细胞群体的方差可以与其平均荧光强度有关,并且可能表现出严重的偏态分布。因此,数据转换的选择会影响自动门控的输出。适当的数据转换有助于在数据范围内可视化和门控细胞群体。经验表明,转换的选择是特定于数据的。我们的目标是比较在高通量、全自动环境中自动门控背景下应用于流式细胞术数据的不同转换的性能。我们检查了流式细胞术中最常用的转换,包括广义双曲反正弦、双指数、对数和广义 Box-Cox 转换,所有这些转换都在广泛用于高通量、自动化流式细胞术数据分析的 BioConductor flowCore 框架内。所有这些转换都有可调节的参数,这些参数对数据的影响对于大多数用户来说是非直观的。通过对转换后的数据做出一些建模假设,我们为这些不同的转换开发了最大似然准则,以优化参数选择。
我们通过测量通过自动门控在散射和荧光通道中发现的细胞群体在样本之间的位置变化,比较了真实数据和模拟数据上参数优化和默认参数(在 flowCore 中)数据转换的性能。我们发现,与默认参数相比,参数优化的转换可以改善可视化效果,减少在样本之间发现的细胞群体位置的变化,并减少单个事件的错误分类(错误门控)。
我们的结果表明,对于荧光通道,首选的转换是参数优化的双指数或广义 Box-Cox,这符合当前的最佳实践。有趣的是,对于散射通道中的群体,我们发现优化的双曲反正弦可能是在高通量环境中比当前无转换标准更好的选择。然而,一般来说,转换的选择仍然取决于数据。我们已经在 BioConductor 包 flowTrans 中实现了我们的算法,该算法是公开可用的。