Wang Meng, Zhou Xiaobo, King Randy W, Wong Stephen T C
Center for Bioinformatics, Harvard Center for Neurodegeneration and Repair, Harvard Medical School, 3rd floor, 1249 Boylston, Boston, MA 02215, USA.
BMC Bioinformatics. 2007 Jan 30;8:32. doi: 10.1186/1471-2105-8-32.
Automated identification of cell cycle phases of individual live cells in a large population captured via automated fluorescence microscopy technique is important for cancer drug discovery and cell cycle studies. Time-lapse fluorescence microscopy images provide an important method to study the cell cycle process under different conditions of perturbation. Existing methods are limited in dealing with such time-lapse data sets while manual analysis is not feasible. This paper presents statistical data analysis and statistical pattern recognition to perform this task.
The data is generated from Hela H2B GFP cells imaged during a 2-day period with images acquired 15 minutes apart using an automated time-lapse fluorescence microscopy. The patterns are described with four kinds of features, including twelve general features, Haralick texture features, Zernike moment features, and wavelet features. To generate a new set of features with more discriminate power, the commonly used feature reduction techniques are used, which include Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Maximum Margin Criterion (MMC), Stepwise Discriminate Analysis based Feature Selection (SDAFS), and Genetic Algorithm based Feature Selection (GAFS). Then, we propose a Context Based Mixture Model (CBMM) for dealing with the time-series cell sequence information and compare it to other traditional classifiers: Support Vector Machine (SVM), Neural Network (NN), and K-Nearest Neighbor (KNN). Being a standard practice in machine learning, we systematically compare the performance of a number of common feature reduction techniques and classifiers to select an optimal combination of a feature reduction technique and a classifier. A cellular database containing 100 manually labelled subsequence is built for evaluating the performance of the classifiers. The generalization error is estimated using the cross validation technique. The experimental results show that CBMM outperforms all other classifies in identifying prophase and has the best overall performance.
The application of feature reduction techniques can improve the prediction accuracy significantly. CBMM can effectively utilize the contextual information and has the best overall performance when combined with any of the previously mentioned feature reduction techniques.
通过自动荧光显微镜技术捕获的大量活细胞中单个活细胞的细胞周期阶段的自动识别对于癌症药物发现和细胞周期研究至关重要。延时荧光显微镜图像提供了一种在不同扰动条件下研究细胞周期过程的重要方法。现有方法在处理此类延时数据集方面存在局限性,而手动分析不可行。本文提出了统计数据分析和统计模式识别来执行此任务。
数据来自在两天时间内成像的Hela H2B GFP细胞,使用自动延时荧光显微镜每隔15分钟采集一次图像。这些模式用四种特征来描述,包括十二个通用特征、哈勒克纹理特征、泽尼克矩特征和小波特征。为了生成一组具有更强区分能力的新特征,使用了常用的特征约简技术,包括主成分分析(PCA)、线性判别分析(LDA)、最大边缘准则(MMC)、基于逐步判别分析的特征选择(SDAFS)和基于遗传算法的特征选择(GAFS)。然后,我们提出了一种基于上下文的混合模型(CBMM)来处理时间序列细胞序列信息,并将其与其他传统分类器进行比较:支持向量机(SVM)、神经网络(NN)和K近邻(KNN)。作为机器学习中的标准做法,我们系统地比较了许多常见特征约简技术和分类器的性能,以选择特征约简技术和分类器的最佳组合。构建了一个包含100个手动标记子序列的细胞数据库来评估分类器的性能。使用交叉验证技术估计泛化误差。实验结果表明,CBMM在识别前期方面优于所有其他分类器,并且具有最佳的整体性能。
特征约简技术的应用可以显著提高预测准确性。CBMM可以有效地利用上下文信息,并且在与任何上述特征约简技术结合使用时具有最佳的整体性能。