Hou Rui, Xie Chao, Gui Yuhan, Li Gang, Li Xiaoyu
Department of Chemistry and State Key Laboratory of Synthetic Chemistry, The University of Hong Kong, Hong Kong SAR, China.
Laboratory for Synthetic Chemistry and Chemical Biology LimitedHealth@InnoHK, Innovation and Technology Commission, Hong Kong SAR, China.
ACS Omega. 2023 May 15;8(21):19057-19071. doi: 10.1021/acsomega.3c02152. eCollection 2023 May 30.
DNA-encoded library (DEL) is a powerful ligand discovery technology that has been widely adopted in the pharmaceutical industry. DEL selections are typically performed with a purified protein target immobilized on a matrix or in solution phase. Recently, DELs have also been used to interrogate the targets in the complex biological environment, such as membrane proteins on live cells. However, due to the complex landscape of the cell surface, the selection inevitably involves significant nonspecific interactions, and the selection data are much noisier than the ones with purified proteins, making reliable hit identification highly challenging. Researchers have developed several approaches to denoise DEL datasets, but it remains unclear whether they are suitable for cell-based DEL selections. Here, we report the proof-of-principle of a new machine-learning (ML)-based approach to process cell-based DEL selection datasets by using a Maximum A Posteriori (MAP) estimation loss function, a probabilistic framework that can account for and quantify uncertainties of noisy data. We applied the approach to a DEL selection dataset, where a library of 7,721,415 compounds was selected against a purified carbonic anhydrase 2 (CA-2) and a cell line expressing the membrane protein carbonic anhydrase 12 (CA-12). The extended-connectivity fingerprint (ECFP)-based regression model using the MAP loss function was able to identify true binders and also reliable structure-activity relationship (SAR) from the noisy cell-based selection datasets. In addition, the regularized enrichment metric (known as MAP enrichment) could also be calculated directly without involving the specific machine-learning model, effectively suppressing low-confidence outliers and enhancing the signal-to-noise ratio. Future applications of this method will focus on de novo ligand discovery from cell-based DEL selections.
DNA编码文库(DEL)是一种强大的配体发现技术,已在制药行业中广泛应用。DEL筛选通常是在固定于基质上或处于溶液相的纯化蛋白靶标上进行的。最近,DEL也已用于在复杂的生物环境中研究靶标,例如活细胞上的膜蛋白。然而,由于细胞表面的情况复杂,筛选不可避免地涉及大量非特异性相互作用,并且筛选数据比使用纯化蛋白时的数据噪声大得多,这使得可靠的命中识别极具挑战性。研究人员已经开发了几种方法来对DEL数据集进行去噪,但尚不清楚它们是否适用于基于细胞的DEL筛选。在此,我们报告了一种基于机器学习(ML)的新方法的原理证明,该方法通过使用最大后验(MAP)估计损失函数来处理基于细胞的DEL筛选数据集,这是一个可以解释和量化噪声数据不确定性的概率框架。我们将该方法应用于一个DEL筛选数据集,其中针对纯化的碳酸酐酶2(CA-2)和表达膜蛋白碳酸酐酶12(CA-12)的细胞系筛选了一个包含7,721,415种化合物的文库。使用MAP损失函数的基于扩展连接指纹(ECFP)的回归模型能够从基于细胞的嘈杂筛选数据集中识别出真正的结合物以及可靠的构效关系(SAR)。此外,正则化富集指标(称为MAP富集)也可以直接计算,而无需涉及特定的机器学习模型,从而有效地抑制低置信度的异常值并提高信噪比。该方法未来的应用将集中于从基于细胞的DEL筛选中进行全新配体发现。