Friedman Ryan Z, Ramu Avinash, Lichtarge Sara, Wu Yawei, Tripp Lloyd, Lyon Daniel, Myers Connie A, Granas David M, Gause Maria, Corbo Joseph C, Cohen Barak A, White Michael A
The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA.
Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA.
Cell Syst. 2025 Jan 15;16(1):101163. doi: 10.1016/j.cels.2024.12.004. Epub 2025 Jan 7.
Deep learning is a promising strategy for modeling cis-regulatory elements. However, models trained on genomic sequences often fail to explain why the same transcription factor can activate or repress transcription in different contexts. To address this limitation, we developed an active learning approach to train models that distinguish between enhancers and silencers composed of binding sites for the photoreceptor transcription factor cone-rod homeobox (CRX). After training the model on nearly all bound CRX sites from the genome, we coupled synthetic biology with uncertainty sampling to generate additional rounds of informative training data. This allowed us to iteratively train models on data from multiple rounds of massively parallel reporter assays. The ability of the resulting models to discriminate between CRX sites with identical sequence but opposite functions establishes active learning as an effective strategy to train models of regulatory DNA. A record of this paper's transparent peer review process is included in the supplemental information.
深度学习是一种用于对顺式调控元件进行建模的很有前景的策略。然而,在基因组序列上训练的模型常常无法解释为什么相同的转录因子在不同背景下能够激活或抑制转录。为了解决这一局限性,我们开发了一种主动学习方法来训练模型,该模型能够区分由光感受器转录因子视锥-视杆同源盒(CRX)的结合位点组成的增强子和沉默子。在几乎所有来自基因组的结合CRX位点上训练模型之后,我们将合成生物学与不确定性采样相结合,以生成更多轮次的信息丰富的训练数据。这使我们能够在多轮大规模平行报告基因检测的数据上迭代训练模型。所得模型区分具有相同序列但功能相反的CRX位点的能力,确立了主动学习作为一种训练调控DNA模型的有效策略。本文透明的同行评审过程记录包含在补充信息中。