Sigdel Madhav, Dinc Imren, Sigdel Madhu S, Dinc Semih, Pusey Marc L, Aygun Ramazan S
Computer Science Department, University of Alabama in Huntsville, Huntsville, 35899 Alabama USA.
Computer Science Department, Troy University, Troy, 36082 Alabama USA.
BioData Min. 2017 Apr 27;10:14. doi: 10.1186/s13040-017-0133-9. eCollection 2017.
Large number of features are extracted from protein crystallization trial images to improve the accuracy of classifiers for predicting the presence of crystals or phases of the crystallization process. The excessive number of features and computationally intensive image processing methods to extract these features make utilization of automated classification tools on stand-alone computing systems inconvenient due to the required time to complete the classification tasks. Combinations of image feature sets, feature reduction and classification techniques for crystallization images benefiting from trace fluorescence labeling are investigated.
Features are categorized into intensity, graph, histogram, texture, shape adaptive, and region features (using binarized images generated by Otsu's, green percentile, and morphological thresholding). The effects of normalization, feature reduction with principle components analysis (PCA), and feature selection using random forest classifier are also analyzed. The time required to extract feature categories is computed and an estimated time of extraction is provided for feature category combinations. We have conducted around 8624 experiments (different combinations of feature categories, binarization methods, feature reduction/selection, normalization, and crystal categories). The best experimental results are obtained using combinations of intensity features, region features using Otsu's thresholding, region features using green percentile thresholding, region features using green percentile thresholding, graph features, and histogram features. Using this feature set combination, 96% accuracy (without misclassifying crystals as non-crystals) was achieved for the first level of classification to determine presence of crystals. Since missing a crystal is not desired, our algorithm is adjusted to achieve a high sensitivity rate. In the second level classification, 74.2% accuracy for (5-class) crystal sub-category classification. Best classification rates were achieved using random forest classifier.
The feature extraction and classification could be completed in about 2 s per image on a stand-alone computing system, which is suitable for real time analysis. These results enable research groups to select features according to their hardware setups for real-time analysis.
从蛋白质结晶试验图像中提取大量特征,以提高用于预测结晶过程中晶体存在或相的分类器的准确性。特征数量过多以及提取这些特征的计算密集型图像处理方法,使得在独立计算系统上使用自动分类工具不方便,因为完成分类任务需要时间。研究了受益于微量荧光标记的结晶图像的图像特征集、特征约简和分类技术的组合。
特征分为强度、图形、直方图、纹理、形状自适应和区域特征(使用大津法、绿色百分位数和形态学阈值生成的二值化图像)。还分析了归一化、主成分分析(PCA)进行特征约简以及使用随机森林分类器进行特征选择的效果。计算了提取特征类所需的时间,并为特征类组合提供了估计的提取时间。我们进行了约8624次实验(特征类、二值化方法、特征约简/选择、归一化和晶体类别的不同组合)。使用强度特征、大津法阈值处理的区域特征、绿色百分位数阈值处理的区域特征、绿色百分位数阈值处理的区域特征、图形特征和直方图特征的组合获得了最佳实验结果。使用此特征集组合,在确定晶体存在的一级分类中实现了96%的准确率(无将晶体误分类为非晶体的情况)。由于不希望错过晶体,我们调整了算法以实现高灵敏度率。在二级分类中,(5类)晶体子类别分类的准确率为74.2%。使用随机森林分类器实现了最佳分类率。
在独立计算系统上,每幅图像的特征提取和分类大约可以在2秒内完成,适用于实时分析。这些结果使研究小组能够根据其硬件设置选择特征进行实时分析。