结合监督和无监督机器学习方法进行表型功能基因组学筛选。

Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening.

机构信息

Department of Cell Biology, Centre for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands.

Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands.

出版信息

SLAS Discov. 2020 Jul;25(6):655-664. doi: 10.1177/2472555220919345. Epub 2020 May 13.

DOI:10.1177/2472555220919345

PMID:32400262

Abstract

There has been an increase in the use of machine learning and artificial intelligence (AI) for the analysis of image-based cellular screens. The accuracy of these analyses, however, is greatly dependent on the quality of the training sets used for building the machine learning models. We propose that unsupervised exploratory methods should first be applied to the data set to gain a better insight into the quality of the data. This improves the selection and labeling of data for creating training sets before the application of machine learning. We demonstrate this using a high-content genome-wide small interfering RNA screen. We perform an unsupervised exploratory data analysis to facilitate the identification of four robust phenotypes, which we subsequently use as a training set for building a high-quality random forest machine learning model to differentiate four phenotypes with an accuracy of 91.1% and a kappa of 0.85. Our approach enhanced our ability to extract new knowledge from the screen when compared with the use of unsupervised methods alone.

摘要

机器学习和人工智能（AI）在基于图像的细胞筛选分析中的应用日益增多。然而，这些分析的准确性在很大程度上取决于用于构建机器学习模型的训练集的质量。我们提出，应该首先将无监督探索性方法应用于数据集，以更好地了解数据的质量。这可以在应用机器学习之前改进训练集的数据选择和标记。我们使用高通量全基因组小干扰 RNA 筛选来证明这一点。我们进行无监督探索性数据分析，以方便识别四个稳健表型，我们随后将其用作训练集，以构建一个高质量的随机森林机器学习模型，该模型可以将四个表型准确地区分为 91.1%，kappa 值为 0.85。与单独使用无监督方法相比，我们的方法增强了我们从筛选中提取新知识的能力。