Seiler Christof, Ferreira Anne-Maud, Kronstad Lisa M, Simpson Laura J, Le Gars Mathieu, Vendrame Elena, Blish Catherine A, Holmes Susan
Department of Data Science and Knowledge Engineering, Maastricht University, Maastricht, The Netherlands.
Mathematics Centre Maastricht, Maastricht University, Maastricht, The Netherlands.
BMC Bioinformatics. 2021 Mar 22;22(1):137. doi: 10.1186/s12859-021-04067-x.
Flow and mass cytometry are important modern immunology tools for measuring expression levels of multiple proteins on single cells. The goal is to better understand the mechanisms of responses on a single cell basis by studying differential expression of proteins. Most current data analysis tools compare expressions across many computationally discovered cell types. Our goal is to focus on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees.
Differential analysis of marker expressions can be difficult due to marker correlations and inter-subject heterogeneity, particularly for studies of human immunology. We address these challenges with two multiple regression strategies: a bootstrapped generalized linear model and a generalized linear mixed model. On simulated datasets, we compare the robustness towards marker correlations and heterogeneity of both strategies. For paired experiments, we find that both strategies maintain the target false discovery rate under medium correlations and that mixed models are statistically more powerful under the correct model specification. For unpaired experiments, our results indicate that much larger patient sample sizes are required to detect differences. We illustrate the CytoGLMM R package and workflow for both strategies on a pregnancy dataset.
Our approach to finding differential proteins in flow and mass cytometry data reduces biases arising from marker correlations and safeguards against false discoveries induced by patient heterogeneity.
流式细胞术和质谱流式细胞术是用于测量单个细胞上多种蛋白质表达水平的重要现代免疫学工具。其目标是通过研究蛋白质的差异表达,在单细胞水平上更好地理解反应机制。当前大多数数据分析工具会比较许多通过计算发现的细胞类型之间的表达情况。我们的目标是仅关注一种细胞类型。我们更窄的应用领域使我们能够定义一个更具体的统计模型,并更容易控制统计保障。
由于标记物相关性和个体间异质性,标记物表达的差异分析可能会很困难,尤其是在人类免疫学研究中。我们用两种多元回归策略应对这些挑战:一种是自展广义线性模型,另一种是广义线性混合模型。在模拟数据集上,我们比较了这两种策略对标记物相关性和异质性的稳健性。对于配对实验,我们发现两种策略在中等相关性下都能维持目标错误发现率,并且在正确的模型设定下,混合模型在统计上更具效力。对于非配对实验,我们的结果表明需要大得多的患者样本量才能检测到差异。我们在一个妊娠数据集上展示了两种策略的CytoGLMM R包及工作流程。
我们在流式细胞术和质谱流式细胞术数据中寻找差异蛋白质的方法减少了由标记物相关性引起的偏差,并防止了患者异质性导致的错误发现。