Smith Kevin, Horvath Peter
Light Microscopy and Screening Centre, ETH Zurich, Switzerland.
Institute of Biochemistry, ETH Zurich, Switzerland Synthetic and Systems Biology Unit, Biological Research Center, Szeged, Hungary
J Biomol Screen. 2014 Jun;19(5):685-95. doi: 10.1177/1087057114527313. Epub 2014 Mar 18.
High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data. In this article, we investigate the impact of active learning on single-cell-based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We consider several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost and can be used to reveal the same phenotypic targets identified using SML. We also identify combinations of active learning strategies and SML methods which perform better than others on the phenotypic profiling problems we studied.
高内涵筛选是发现新药和开展基础生物学研究的一种强大方法。越来越多的高内涵筛选开始依赖监督式机器学习(SML)来执行自动表型分类,将其作为分析的一个关键步骤。然而,这是有代价的,即训练预测模型所需的标记示例。分类性能会随着标记示例数量的增加而提高,而且由于标记示例需要专家花费时间,训练过程意味着大量的时间投入。主动学习策略试图通过向注释者呈现最相关的示例来克服这一瓶颈,从而在将获取标记数据的成本降至最低的同时实现高精度。在本文中,我们利用来自三个大规模RNA干扰高内涵筛选的数据,研究主动学习对基于单细胞的表型识别的影响,这些筛选代表了不同的表型分析问题。我们考虑了主动学习策略和流行的SML方法的几种组合。我们的结果表明,主动学习显著降低了时间成本,并且可用于揭示使用SML识别出的相同表型靶点。我们还确定了在我们研究的表型分析问题上比其他组合表现更好的主动学习策略和SML方法的组合。