Biostatistics, Boston University, Boston, MA, USA.
Orthopaedic Surgery, Boston University, Boston, MA, USA.
Bioinformatics. 2019 Mar 1;35(5):778-786. doi: 10.1093/bioinformatics/bty696.
Clustering algorithms like K-Means and standard Gaussian mixture models (GMM) fail to account for the structure of variability of replicated data or repeated measures over time. Additionally, a priori cluster number assumptions add an additional complexity to the process. Current methods to optimize cluster labels and number can be inaccurate or computationally intensive for temporal gene expression data with this additional variability.
An extension to a model-based clustering algorithm is proposed using mixtures of mixed effects polynomial regression models and the EM algorithm with an entropy penalized log-likelihood function (EPEM). The EPEM is used to cluster temporal gene expression data with this additional variability. The addition of random effects in our model decreased the misclassification error when compared to mixtures of fixed effects models or other methods such as K-Means and GMM. Applying our method to microarray data from a fracture healing study revealed distinct temporal patterns of gene expression.
https://github.com/darlenelu72/EPEM-GMM.
Supplementary data are available at Bioinformatics online.
聚类算法,如 K-Means 和标准高斯混合模型(GMM),无法考虑重复数据或随时间重复测量的可变性结构。此外,先验聚类数假设为该过程增加了额外的复杂性。对于具有这种额外可变性的时间基因表达数据,当前优化聚类标签和数量的方法可能不准确或计算强度大。
提出了一种基于模型的聚类算法扩展,该算法使用混合混合效应多项式回归模型和具有熵惩罚对数似然函数(EPEM)的 EM 算法。EPEM 用于对具有这种额外可变性的时间基因表达数据进行聚类。与固定效应模型的混合物或 K-Means 和 GMM 等其他方法相比,我们模型中的随机效应的添加降低了分类错误率。将我们的方法应用于骨折愈合研究的微阵列数据揭示了基因表达的明显时间模式。
https://github.com/darlenelu72/EPEM-GMM。
补充数据可在 Bioinformatics 在线获得。