基于不确定性感知概率损失函数的 DNA 编码库计数数据的机器学习。

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function.

机构信息

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.

Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.

出版信息

J Chem Inf Model. 2022 May 23;62(10):2316-2331. doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.

DOI:10.1021/acs.jcim.2c00041

PMID:35535861

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10830332/

Abstract

DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find novel small molecules that bind a protein target. Applying QSAR modeling to DEL selection data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been done recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" in order to accommodate the sparse and noisy nature of DEL data. However, a binary classification model cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules, using a custom negative-log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships. Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a DEL dataset of 108,528 compounds screened against carbonic anhydrase (CAIX), and a dataset of 5,655,000 compounds screened against soluble epoxide hydrolase (sEH) and SIRT2. Due to the treatment of uncertainty in the data through the negative-log-likelihood loss used during training, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying structure-activity trends and highly enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression modeling is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions.

摘要

DNA 编码文库 (DEL) 筛选和定量构效关系 (QSAR) 建模是药物发现中用于寻找与蛋白质靶标结合的新型小分子的两种技术。将 QSAR 建模应用于 DEL 选择数据可以促进用于非 DNA 合成和评估的化合物的选择。最近，人们通过训练二进制分类器来学习聚集“disynthons”的 DEL 富集，以适应 DEL 数据的稀疏性和噪声特性，从而完成了这种组合方法。然而，二进制分类模型不能区分不同水平的富集，并且在 disynthon 聚集过程中可能会丢失信息。在这里，我们展示了一种使用定制负对数似然损失函数学习单个分子 DEL 富集的回归方法，该方法有效地对 DEL 数据进行去噪，并为可视化学习的结构-活性关系提供了机会。我们的方法在一个频繁主义观点下，明确地对 DEL 实验工作流程中使用的测序过程的泊松统计进行建模。我们在针对碳酸酐酶 (CAIX) 筛选的 108,528 种化合物的 DEL 数据集和针对可溶性环氧合酶 (sEH) 和 SIRT2 筛选的 5,655,000 种化合物的数据集上说明了这种方法。由于在训练过程中使用的负对数似然损失来处理数据中的不确定性，因此模型可以忽略低置信度的异常值。虽然我们的方法在向新结构外推方面没有显示出优势，但我们希望我们的去噪和可视化管道能够用于识别 DEL 数据中的结构-活性趋势和高度富集的药效团。此外，这种对不确定性感知回归建模的方法适用于其他稀疏或噪声数据集，其中随机性的性质是已知的或可以建模的；特别是，我们使用的泊松富集比度量可以应用于其他需要比较两种实验条件下测序计数数据的设置。

相似文献

Machine Learning on DNA-Encoded Library Count Data Using an Uncertainty-Aware Probabilistic Loss Function.基于不确定性感知概率损失函数的 DNA 编码库计数数据的机器学习。

J Chem Inf Model. 2022 May 23;62(10):2316-2331. doi: 10.1021/acs.jcim.2c00041. Epub 2022 May 10.

Building Block-Based Binding Predictions for DNA-Encoded Libraries.基于积木的 DNA 编码文库结合预测。

J Chem Inf Model. 2023 Aug 28;63(16):5120-5132. doi: 10.1021/acs.jcim.3c00588. Epub 2023 Aug 14.

Machine-Learning-Based Data Analysis Method for Cell-Based Selection of DNA-Encoded Libraries.基于机器学习的用于基于细胞筛选DNA编码文库的数据分析方法

ACS Omega. 2023 May 15;8(21):19057-19071. doi: 10.1021/acsomega.3c02152. eCollection 2023 May 30.

Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding.基于 DNA 编码文库的机器学习：发现命中物的新范式。

J Med Chem. 2020 Aug 27;63(16):8857-8866. doi: 10.1021/acs.jmedchem.0c00452. Epub 2020 Jun 11.

Compositional Deep Probabilistic Models of DNA-Encoded Libraries.DNA 编码文库的组成深度概率模型。

J Chem Inf Model. 2024 Feb 26;64(4):1123-1133. doi: 10.1021/acs.jcim.3c01699. Epub 2024 Feb 9.

Denoising DNA Encoded Library Screens with Sparse Learning.基于稀疏学习的 DNA 编码文库筛选降噪。

ACS Comb Sci. 2020 Aug 10;22(8):410-421. doi: 10.1021/acscombsci.0c00007. Epub 2020 Jun 26.

Quantitative Comparison of Enrichment from DNA-Encoded Chemical Library Selections.DNA 编码化学库筛选的富集度定量比较。

ACS Comb Sci. 2019 Feb 11;21(2):75-82. doi: 10.1021/acscombsci.8b00116. Epub 2019 Jan 23.

Exploring the Lower Limit of Individual DNA-Encoded Library Molecules in Selection.在选择中探索个体 DNA 编码文库分子的下限。

SLAS Discov. 2020 Jun;25(5):523-529. doi: 10.1177/2472555219893949. Epub 2019 Dec 20.

A method for estimating binding affinity from primary DEL selection data.从原始 DEL 选择数据估算结合亲和力的方法。

Biochem Biophys Res Commun. 2020 Dec 3;533(2):249-255. doi: 10.1016/j.bbrc.2020.04.029. Epub 2020 May 19.

Comparative evaluation of DNA-encoded chemical selections performed using DNA in single-stranded or double-stranded format.比较使用单链或双链 DNA 进行 DNA 编码化学选择的效果。

Biochem Biophys Res Commun. 2020 Dec 3;533(2):223-229. doi: 10.1016/j.bbrc.2020.04.035. Epub 2020 May 5.

引用本文的文献

Protein-ligand data at scale to support machine learning.大规模蛋白质-配体数据以支持机器学习。

Nat Rev Chem. 2025 Jul 23. doi: 10.1038/s41570-025-00737-z.

Solid-phase DNA-encoded library synthesis: a master builder's instructions.固相DNA编码文库合成：一位总建筑师的指南。

Nat Protoc. 2025 May 22. doi: 10.1038/s41596-025-01190-4.

Widespread false negatives in DNA-encoded library data: how linker effects impair machine learning-based lead prediction.DNA编码文库数据中广泛存在的假阴性：接头效应如何损害基于机器学习的先导化合物预测

Chem Sci. 2025 May 9. doi: 10.1039/d5sc00844a.

Publishing neural networks in drug discovery might compromise training data privacy.在药物发现领域发表神经网络可能会危及训练数据的隐私。

J Cheminform. 2025 Mar 26;17(1):38. doi: 10.1186/s13321-025-00982-w.

Development of an FKBP12-recruiting chemical-induced proximity DNA-encoded library and its application to discover an autophagy potentiator.一种招募FKBP12的化学诱导邻近性DNA编码文库的开发及其在发现自噬增强剂中的应用。

Cell Chem Biol. 2025 Mar 20;32(3):498-510.e35. doi: 10.1016/j.chembiol.2024.12.002. Epub 2025 Jan 2.

Highly Selective Novel Heme Oxygenase-1 Hits Found by DNA-Encoded Library Machine Learning beyond the DEL Chemical Space.通过DNA编码文库机器学习在DEL化学空间之外发现的高选择性新型血红素加氧酶-1作用物。

ACS Med Chem Lett. 2024 Aug 21;15(9):1456-1466. doi: 10.1021/acsmedchemlett.4c00121. eCollection 2024 Sep 12.

Machine learning in preclinical drug discovery.机器学习在临床前药物发现中的应用。

Nat Chem Biol. 2024 Aug;20(8):960-973. doi: 10.1038/s41589-024-01679-1. Epub 2024 Jul 19.

Evolution of chemistry and selection technology for DNA-encoded library.DNA编码文库的化学与筛选技术的发展

Acta Pharm Sin B. 2024 Feb;14(2):492-516. doi: 10.1016/j.apsb.2023.10.001. Epub 2023 Oct 11.

DNA-encoded library-enabled discovery of proximity-inducing small molecules.DNA 编码库助力发现诱导邻近小分子的化合物。

Nat Chem Biol. 2024 Feb;20(2):170-179. doi: 10.1038/s41589-023-01458-4. Epub 2023 Nov 2.

Rational Screening for Cooperativity in Small-Molecule Inducers of Protein-Protein Associations.小分子诱导蛋白-蛋白相互作用的协同作用的合理筛选。

J Am Chem Soc. 2023 Oct 25;145(42):23281-23291. doi: 10.1021/jacs.3c08307. Epub 2023 Oct 10.

本文引用的文献

Selecting Approaches for Hit Identification and Increasing Options by Building the Efficient Discovery of Actionable Chemical Matter from DNA-Encoded Libraries.从 DNA 编码文库中高效发现有治疗作用的化学物质以选择命中鉴定方法并增加选择方案。

SLAS Discov. 2021 Feb;26(2):263-280. doi: 10.1177/2472555220979589. Epub 2021 Jan 8.

Uncertainty Quantification Using Neural Networks for Molecular Property Prediction.使用神经网络进行分子性质预测的不确定性量化。

J Chem Inf Model. 2020 Aug 24;60(8):3770-3780. doi: 10.1021/acs.jcim.0c00502. Epub 2020 Aug 4.

Characterization of Specific -α-Acetyltransferase 50 (Naa50) Inhibitors Identified Using a DNA Encoded Library.使用DNA编码文库鉴定的特异性α-乙酰转移酶50（Naa50）抑制剂的表征

ACS Med Chem Lett. 2020 Apr 10;11(6):1175-1184. doi: 10.1021/acsmedchemlett.0c00029. eCollection 2020 Jun 11.

Machine Learning on DNA-Encoded Libraries: A New Paradigm for Hit Finding.基于 DNA 编码文库的机器学习：发现命中物的新范式。

J Med Chem. 2020 Aug 27;63(16):8857-8866. doi: 10.1021/acs.jmedchem.0c00452. Epub 2020 Jun 11.

A simple method for determining compound affinity and chemical yield from DNA-encoded library selections.从 DNA 编码文库筛选中确定化合物亲和力和化学产率的一种简单方法。

Biochem Biophys Res Commun. 2020 Jun 18;527(1):250-256. doi: 10.1016/j.bbrc.2020.04.024. Epub 2020 May 4.

QSAR without borders.无边界定量构效关系。

Chem Soc Rev. 2020 Jun 7;49(11):3525-3564. doi: 10.1039/d0cs00098a. Epub 2020 May 1.

Discovery of the First in Vivo Active Inhibitors of the Soluble Epoxide Hydrolase Phosphatase Domain.发现首例可抑制可溶性环氧化物水解酶磷酸酶结构域的体内活性抑制剂。

J Med Chem. 2019 Sep 26;62(18):8443-8460. doi: 10.1021/acs.jmedchem.9b00445. Epub 2019 Sep 17.

Analyzing Learned Molecular Representations for Property Prediction.分析用于性质预测的学习分子表示。

J Chem Inf Model. 2019 Aug 26;59(8):3370-3388. doi: 10.1021/acs.jcim.9b00237. Epub 2019 Aug 13.

QSAR Studies of New Pyrido[3,4-]indole Derivatives as Inhibitors of Colon and Pancreatic Cancer Cell Proliferation.新型吡啶并[3,4-]吲哚衍生物作为结肠癌和胰腺癌细胞增殖抑制剂的定量构效关系研究

Med Chem Res. 2018 Dec;27(11-12):2466-2481. doi: 10.1007/s00044-018-2250-5. Epub 2018 Oct 3.

DNA Barcoding a Complete Matrix of Stereoisomeric Small Molecules.DNA 条码全矩阵的立体异构小分子。

J Am Chem Soc. 2019 Jul 3;141(26):10225-10235. doi: 10.1021/jacs.9b01203. Epub 2019 Jun 25.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验