Chen Kun, Mishra Neha, Smyth Joan, Bar Haim, Schifano Elizabeth, Kuo Lynn, Chen Ming-Hui
Department of Statistics, University of Connecticut.
Department of Pathobiology and Veterinary Science, University of Connecticut.
J Am Stat Assoc. 2018;113(522):546-559. doi: 10.1080/01621459.2017.1356314. Epub 2018 Jun 12.
Necrotic enteritis (NE) is a serious disease of poultry caused by the bacterium . To identify proteins of that confer virulence with respect to NE, the protein secretions of four NE disease-producing strains and one baseline non-disease-producing strain of were examined. The problem then becomes a clustering task, for the identification of two extreme groups of proteins that were produced at either concordantly higher or concordantly lower levels across all four disease-producing strains compared to the baseline, when most of the proteins do not exhibit significant change across all strains. However, the existence of some nuisance proteins of discordant change may severely distort any biologically meaningful cluster pattern. We develop a tailored multivariate clustering approach to robustly identify the proteins of concordant change. Using a three-component normal mixture model as the skeleton, our approach incorporates several constraints to account for biological expectations and data characteristics. More importantly, we adopt a sparse mean-shift parameterization in the reference distribution, coupled with a regularized estimation approach, to flexibly accommodate proteins of discordant change. We explore the connections and differences between our approach and other robust clustering methods, and resolve the issue of unbounded likelihood under an eigenvalue-ratio condition. Simulation studies demonstrate the superior performance of our method compared with a number of alternative approaches. Our protein analysis along with further biological investigations may shed light on the discovery of the complete set of virulence factors in NE.
坏死性肠炎(NE)是由细菌引起的家禽严重疾病。为了鉴定与NE相关的毒力蛋白,研究人员检测了四种产生NE疾病的菌株和一种基线非致病菌株的蛋白质分泌情况。由于大多数蛋白质在所有菌株中没有显著变化,因此该问题就变成了一个聚类任务,即识别在所有四种致病菌株中相对于基线水平一致较高或一致较低产生的两组极端蛋白质。然而,一些变化不一致的干扰蛋白的存在可能会严重扭曲任何具有生物学意义的聚类模式。我们开发了一种定制的多变量聚类方法来稳健地识别一致变化的蛋白质。以三组分正态混合模型为框架,我们的方法纳入了几个约束条件以考虑生物学预期和数据特征。更重要的是,我们在参考分布中采用稀疏均值漂移参数化,并结合正则化估计方法,以灵活地处理变化不一致的蛋白质。我们探讨了我们的方法与其他稳健聚类方法之间的联系和差异,并在特征值比条件下解决了无界似然性问题。模拟研究表明,与许多替代方法相比,我们的方法具有优越的性能。我们的蛋白质分析以及进一步的生物学研究可能有助于发现NE中完整的毒力因子集。