Suppr超能文献

使用深度卷积神经网络对基因表达进行预测建模及DNA结合位点定位

Predictive Modeling of Gene Expression and Localization of DNA Binding Site Using Deep Convolutional Neural Networks.

作者信息

Karshenas Arman, Röschinger Tom, Garcia Hernan G

机构信息

Biophysics Graduate Group, University of California at Berkeley, Berkeley, CA, USA.

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.

出版信息

bioRxiv. 2024 Dec 20:2024.12.17.629042. doi: 10.1101/2024.12.17.629042.

Abstract

Despite the sequencing revolution, large swaths of the genomes sequenced to date lack any information about the arrangement of transcription factor binding sites on regulatory DNA. Massively Parallel Reporter Assays (MPRAs) have the potential to dramatically accelerate our genomic annotations by making it possible to measure the gene expression levels driven by thousands of mutational variants of a regulatory region. However, the interpretation of such data often assumes that each base pair in a regulatory sequence contributes independently to gene expression. To enable the analysis of this data in a manner that accounts for possible correlations between distant bases along a regulatory sequence, we developed the Deep learning Adaptable Regulatory Sequence Identifier (DARSI). This convolutional neural network leverages MPRA data to predict gene expression levels directly from raw regulatory DNA sequences. By harnessing this predictive capacity, DARSI systematically identifies transcription factor binding sites within regulatory regions at single-base pair resolution. To validate its predictions, we benchmarked DARSI against curated databases, confirming its accuracy in predicting transcription factor binding sites. Additionally, DARSI predicted novel unmapped binding sites, paving the way for future experimental efforts to confirm the existence of these binding sites and to identify the transcription factors that target those sites. Thus, by automating and improving the annotation of regulatory regions, DARSI generates experimentally actionable predictions that can feed iterations of the theory-experiment cycle aimed at reaching a predictive understanding of transcriptional control.

摘要

尽管测序技术发生了变革,但迄今为止测序的大片段基因组缺乏关于调控DNA上转录因子结合位点排列的任何信息。大规模平行报告基因检测(MPRAs)有潜力极大地加速我们的基因组注释,因为它能够测量由调控区域的数千个突变变体驱动的基因表达水平。然而,对此类数据的解释通常假定调控序列中的每个碱基对独立地对基因表达有贡献。为了能够以一种考虑到调控序列上远距离碱基之间可能存在的相关性的方式来分析这些数据,我们开发了深度学习适应性调控序列标识符(DARSI)。这个卷积神经网络利用MPRA数据直接从原始调控DNA序列预测基因表达水平。通过利用这种预测能力,DARSI以单碱基对分辨率系统地识别调控区域内的转录因子结合位点。为了验证其预测结果,我们将DARSI与精选数据库进行了基准测试,证实了其在预测转录因子结合位点方面的准确性。此外,DARSI预测了新的未映射结合位点,为未来确认这些结合位点的存在并识别靶向这些位点的转录因子的实验工作铺平了道路。因此,通过自动化和改进调控区域的注释,DARSI生成了可通过实验操作的预测结果,这些结果可以为旨在实现对转录控制的预测性理解的理论 - 实验循环的迭代提供依据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/01c5/11702772/0d58025f0964/nihpp-2024.12.17.629042v1-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验