Filzen Tracey M, Kutchukian Peter S, Hermes Jeffrey D, Li Jing, Tudor Matthew
Medical Writing, Merck Research Laboratories, Upper Gwynedd, Pennsylvania, United States of America.
Informatics, Merck Research Laboratories, Boston, Massachusetts, United States of America.
PLoS Comput Biol. 2017 Feb 9;13(2):e1005335. doi: 10.1371/journal.pcbi.1005335. eCollection 2017 Feb.
High throughput mRNA expression profiling can be used to characterize the response of cell culture models to perturbations such as pharmacologic modulators and genetic perturbations. As profiling campaigns expand in scope, it is important to homogenize, summarize, and analyze the resulting data in a manner that captures significant biological signals in spite of various noise sources such as batch effects and stochastic variation. We used the L1000 platform for large-scale profiling of 978 representative genes across thousands of compound treatments. Here, a method is described that uses deep learning techniques to convert the expression changes of the landmark genes into a perturbation barcode that reveals important features of the underlying data, performing better than the raw data in revealing important biological insights. The barcode captures compound structure and target information, and predicts a compound's high throughput screening promiscuity, to a higher degree than the original data measurements, indicating that the approach uncovers underlying factors of the expression data that are otherwise entangled or masked by noise. Furthermore, we demonstrate that visualizations derived from the perturbation barcode can be used to more sensitively assign functions to unknown compounds through a guilt-by-association approach, which we use to predict and experimentally validate the activity of compounds on the MAPK pathway. The demonstrated application of deep metric learning to large-scale chemical genetics projects highlights the utility of this and related approaches to the extraction of insights and testable hypotheses from big, sometimes noisy data.
高通量mRNA表达谱分析可用于表征细胞培养模型对诸如药理调节剂和基因扰动等干扰的反应。随着分析活动范围的扩大,以一种能够捕捉重要生物信号的方式对所得数据进行同质化、汇总和分析变得很重要,尽管存在各种噪声源,如批次效应和随机变异。我们使用L1000平台对数千种化合物处理下的978个代表性基因进行大规模分析。在此,描述了一种方法,该方法使用深度学习技术将标志性基因的表达变化转换为扰动条形码,以揭示基础数据的重要特征,在揭示重要生物学见解方面比原始数据表现更好。该条形码捕获化合物结构和靶标信息,并比原始数据测量更准确地预测化合物的高通量筛选混杂性,表明该方法揭示了表达数据中原本被噪声纠缠或掩盖的潜在因素。此外,我们证明,从扰动条形码衍生的可视化可用于通过关联有罪方法更灵敏地为未知化合物赋予功能,我们用该方法预测并通过实验验证化合物对MAPK途径的活性。深度度量学习在大规模化学遗传学项目中的应用表明了这种方法以及相关方法在从大量、有时有噪声的数据中提取见解和可检验假设方面的实用性。