Xiao Mengli, Shen Xiaotong, Pan Wei
Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota.
School of Statistics, University of Minnesota, Minneapolis, Minnesota.
Genet Epidemiol. 2019 Apr;43(3):330-341. doi: 10.1002/gepi.22182. Epub 2019 Jan 4.
Single-cell microscopy image analysis has proved invaluable in protein subcellular localization for inferring gene/protein function. Fluorescent-tagged proteins across cellular compartments are tracked and imaged in response to genetic or environmental perturbations. With a large number of images generated by high-content microscopy while manual labeling is both labor-intensive and error-prone, machine learning offers a viable alternative for automatic labeling of subcellular localizations. Contrarily, in recent years applications of deep learning methods to large datasets in natural images and other domains have become quite successful. An appeal of deep learning methods is that they can learn salient features from complicated data with little data preprocessing. For such purposes, we applied several representative types of deep convolutional neural networks (CNNs) and two popular ensemble methods, random forests and gradient boosting, to predict protein subcellular localization with a moderately large cell image data set. We show a consistently better predictive performance of CNNs over the two ensemble methods. We also demonstrate the use of CNNs for feature extraction. In the end, we share our computer code and pretrained models to facilitate CNN's applications in genetics and computational biology.
单细胞显微镜图像分析在推断基因/蛋白质功能的蛋白质亚细胞定位方面已被证明具有极高价值。对跨细胞区室的荧光标记蛋白质进行追踪并成像,以响应遗传或环境扰动。通过高内涵显微镜生成大量图像,而手动标记既费力又容易出错,机器学习为亚细胞定位的自动标记提供了可行的替代方法。相反,近年来深度学习方法在自然图像和其他领域的大型数据集中的应用已相当成功。深度学习方法的一个吸引力在于它们几乎无需数据预处理就能从复杂数据中学习显著特征。出于此类目的,我们应用了几种具有代表性的深度卷积神经网络(CNN)以及两种流行的集成方法——随机森林和梯度提升,来利用一个中等规模的细胞图像数据集预测蛋白质亚细胞定位。我们展示了CNN相对于这两种集成方法始终具有更好的预测性能。我们还展示了CNN用于特征提取的情况。最后,我们分享我们的计算机代码和预训练模型,以促进CNN在遗传学和计算生物学中的应用。