Lin Chung-Chih, Tsai Yuh-Show, Lin Yu-Shi, Chiu Tai-Yu, Hsiung Chia-Cheng, Lee May-I, Simpson Jeremy C, Hsu Chun-Nan
Faculty of Life Sciences and Institute of Genomes, National Yang-Ming University, Taipei, Taiwan.
Bioinformatics. 2007 Dec 15;23(24):3374-81. doi: 10.1093/bioinformatics/btm497. Epub 2007 Oct 22.
Determining locations of protein expression is essential to understand protein function. Advances in green fluorescence protein (GFP) fusion proteins and automated fluorescence microscopy allow for rapid acquisition of large collections of protein localization images. Recognition of these cell images requires an automated image analysis system. Approaches taken by previous work concentrated on designing a set of optimal features and then applying standard machine-learning algorithms. In fact, trends of recent advances in machine learning and computer vision can be applied to improve the performance. One trend is the advances in multiclass learning with error-correcting output codes (ECOC). Another trend is the use of a large number of weak detectors with boosting for detecting objects in images of real-world scenes.
We take advantage of these advances to propose a new learning algorithm, AdaBoost.ERC, coupled with weak and strong detectors, to improve the performance of automatic recognition of protein subcellular locations in cell images. We prepared two image data sets of CHO and Vero cells and downloaded a HeLa cell image data set in the public domain to evaluate our new method. We show that AdaBoost.ERC outperforms other AdaBoost extensions. We demonstrate the benefit of weak detectors by showing significant performance improvements over classifiers using only strong detectors. We also empirically test our method's capability of generalizing to heterogeneous image collections. Compared with previous work, our method performs reasonably well for the HeLa cell images.
CHO and Vero cell images, their corresponding feature sets (SSLF and WSLF), our new learning algorithm, AdaBoost.ERC, and Supplementary Material are available at http://aiia.iis.sinica.edu.tw/
确定蛋白质表达位置对于理解蛋白质功能至关重要。绿色荧光蛋白(GFP)融合蛋白和自动荧光显微镜技术的进步使得能够快速获取大量蛋白质定位图像。识别这些细胞图像需要一个自动图像分析系统。先前工作所采用的方法集中在设计一组最优特征,然后应用标准机器学习算法。事实上,机器学习和计算机视觉领域的最新进展趋势可用于提高性能。一个趋势是纠错输出码(ECOC)在多类学习方面的进展。另一个趋势是使用大量弱检测器并结合增强技术来检测真实场景图像中的物体。
我们利用这些进展提出了一种新的学习算法AdaBoost.ERC,结合弱检测器和强检测器,以提高细胞图像中蛋白质亚细胞定位自动识别的性能。我们准备了CHO和Vero细胞的两个图像数据集,并下载了公共领域的HeLa细胞图像数据集来评估我们的新方法。我们表明AdaBoost.ERC优于其他AdaBoost扩展方法。通过展示相较于仅使用强检测器的分类器有显著的性能提升,我们证明了弱检测器的优势。我们还通过实验测试了我们的方法对异构图像集的泛化能力。与先前工作相比,我们的方法在HeLa细胞图像上表现良好。
CHO和Vero细胞图像、它们相应的特征集(SSLF和WSLF)、我们的新学习算法AdaBoost.ERC以及补充材料可在http://aiia.iis.sinica.edu.tw/获取。