Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2022 May 23;62(10):2316-2331. doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.
DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.
DNA 编码文库 (DEL) 筛选和定量构效关系 (QSAR) 建模是药物发现中用于寻找与蛋白质靶标结合的新型小分子的两种技术。将 QSAR 建模应用于 DEL 选择数据可以促进用于非 DNA 合成和评估的化合物的选择。最近,人们通过训练二进制分类器来学习聚集“disynthons”的 DEL 富集,以适应 DEL 数据的稀疏性和噪声特性,从而完成了这种组合方法。然而,二进制分类模型不能区分不同水平的富集,并且在 disynthon 聚集过程中可能会丢失信息。在这里,我们展示了一种使用定制负对数似然损失函数学习单个分子 DEL 富集的回归方法,该方法有效地对 DEL 数据进行去噪,并为可视化学习的结构-活性关系提供了机会。我们的方法在一个频繁主义观点下,明确地对 DEL 实验工作流程中使用的测序过程的泊松统计进行建模。我们在针对碳酸酐酶 (CAIX) 筛选的 108,528 种化合物的 DEL 数据集和针对可溶性环氧合酶 (sEH) 和 SIRT2 筛选的 5,655,000 种化合物的数据集上说明了这种方法。由于在训练过程中使用的负对数似然损失来处理数据中的不确定性,因此模型可以忽略低置信度的异常值。虽然我们的方法在向新结构外推方面没有显示出优势,但我们希望我们的去噪和可视化管道能够用于识别 DEL 数据中的结构-活性趋势和高度富集的药效团。此外,这种对不确定性感知回归建模的方法适用于其他稀疏或噪声数据集,其中随机性的性质是已知的或可以建模的;特别是,我们使用的泊松富集比度量可以应用于其他需要比较两种实验条件下测序计数数据的设置。