Eliot M, Azzoni L, Firnhaber C, Stevens W, Glencross D K, Sanne I, Montaner L J, Foulkes A S
Division of Biostatistics, University of Massachusetts, Amherst, MA 01003, USA.
Adv Bioinformatics. 2009;2009:235320. doi: 10.1155/2009/235320. Epub 2010 Jan 21.
We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry: classification and regression trees (CARTs), random forests (RFs), and logic regression (LR). Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200 and 350 cell/muL. A comparison to a more standard contingency table analysis is provided. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. In all cases considered, baseline CD3-DR-CD56+CD16+ emerges as an important predictor variable, while the tree-based approaches identify additional variables as potentially informative. Application of tree-based methods to our data suggests that a combination of baseline immune activation states, with emphasis on CD8 T-cell activation, may be a better predictor than any single T-cell/innate cell subset analyzed. Taken together, we show that tree-based methods can be successfully applied to flow cytometry data to better inform and discover associations that may not emerge in the context of a univariate analysis.
分类与回归树(CART)、随机森林(RF)和逻辑回归(LR)。具体而言,我们考虑的问题是,对于开始抗逆转录病毒治疗且CD4计数在200至350个细胞/微升之间的HIV-1感染者,什么能最佳预测CD4 T细胞恢复情况。我们还提供了与更标准的列联表分析的比较。虽然列联表分析和随机森林能提供每个潜在预测变量重要性的信息,但分类与回归树和逻辑回归能进一步洞察共同预测结果的变量组合。在所考虑的所有情况中,基线CD3-DR-CD56+CD16+都是一个重要的预测变量,而基于树的方法还识别出其他可能有信息价值的变量。将基于树的方法应用于我们的数据表明,强调CD8 T细胞活化的基线免疫激活状态组合,可能比任何单个分析的T细胞/固有细胞亚群是更好的预测指标。总体而言,我们表明基于树的方法可成功应用于流式细胞术数据,以更好地揭示单变量分析中可能未出现的关联。