Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Bioinformatics. 2013 Sep 15;29(18):2343-9. doi: 10.1093/bioinformatics/btt392. Epub 2013 Jul 8.
Evaluation of previous systems for automated determination of subcellular location from microscope images has been done using datasets in which each location class consisted of multiple images of the same representative protein. Here, we frame a more challenging and useful problem where previously unseen proteins are to be classified.
Using CD-tagging, we generated two new image datasets for evaluation of this problem, which contain several different proteins for each location class. Evaluation of previous methods on these new datasets showed that it is much harder to train a classifier that generalizes across different proteins than one that simply recognizes a protein it was trained on. We therefore developed and evaluated additional approaches, incorporating novel modifications of local features techniques. These extended the notion of local features to exploit both the protein image and any reference markers that were imaged in parallel. With these, we obtained a large accuracy improvement in our new datasets over existing methods. Additionally, these features help achieve classification improvements for other previously studied datasets.
The datasets are available for download at http://murphylab.web.cmu.edu/data/. The software was written in Python and C++ and is available under an open-source license at http://murphylab.web.cmu.edu/software/. The code is split into a library, which can be easily reused for other data and a small driver script for reproducing all results presented here. A step-by-step tutorial on applying the methods to new datasets is also available at that address.
Supplementary data are available at Bioinformatics online.
以前从显微镜图像中自动确定亚细胞位置的系统评估是使用每个位置类别的数据集进行的,其中每个位置类别都包含同一种代表性蛋白质的多个图像。在这里,我们提出了一个更具挑战性和实用性的问题,即需要对以前未见过的蛋白质进行分类。
使用 CD 标记,我们生成了两个用于评估该问题的新图像数据集,每个位置类别包含几种不同的蛋白质。在这些新数据集中评估以前的方法表明,训练一个能够跨不同蛋白质泛化的分类器比简单地识别它所训练的蛋白质的分类器要困难得多。因此,我们开发并评估了其他方法,包括对局部特征技术的新颖修改。这些方法扩展了局部特征的概念,以利用蛋白质图像和同时成像的任何参考标记。通过这些方法,我们在新数据集上获得了比现有方法更高的准确性,同时也提高了其他先前研究的数据集的分类效果。
数据集可在 http://murphylab.web.cmu.edu/data/ 下载。软件是用 Python 和 C++ 编写的,并在开源许可证下在 http://murphylab.web.cmu.edu/software/ 上提供。代码分为一个库和一个小型驱动脚本,库可以很容易地用于其他数据,驱动脚本用于重现这里呈现的所有结果。在同一地址还提供了一个应用这些方法到新数据集的分步教程。
补充数据可在生物信息学在线获得。