Liu Xiaojian, Zhu Weimin, Ding Xiaohan, Fang Yi, Wang Shengfan, Zhu Lin, Shen Hong-Bin, Pan Xiaoyong
Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University; Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China.
Nucleic Acids Res. 2025 Jul 19;53(14). doi: 10.1093/nar/gkaf748.
RNA-binding proteins play crucial roles in various RNA-associated biological processes, which are closely linked to cellular function and disease. Based on CLIP-seq data, the existing deep learning methods are developed to predict protein-RNA interactions. However, CLIP-seq relies on gene expression, which varies significantly across cells. Existing methods are typically trained on peak-associated binding sites and implicitly defined non-binding sites, without considering the cell-specific expression profiles. Given the dynamic nature of protein-RNA interactions, these methods struggle to accurately predict the binding nucleotides and strength of proteins on RNAs across cell lines. Therefore, this study proposes a novel deep learning-based method, iDeepB, designed to predict the proteins binding profile on RNAs at base resolution by integrating cell-line-specific gene expression profiles. iDeepB first constructs expression-aware benchmark datasets based on cell-specific RNA-seq and eCLIP-seq data, which is used to train a hybrid deep network with multi-head attention, enabling the prediction of protein binding profiles, analysis of binding motif syntax composition, and quantification of functional effects of genome mutations related to human diseases. Comprehensive evaluation on the newly developed benchmark datasets demonstrates that iDeepB outperforms existing methods in predicting protein binding profile on RNAs.
RNA结合蛋白在各种与RNA相关的生物过程中发挥着关键作用,这些过程与细胞功能和疾病密切相关。基于CLIP-seq数据,现有的深度学习方法被开发用于预测蛋白质-RNA相互作用。然而,CLIP-seq依赖于基因表达,而基因表达在不同细胞之间差异显著。现有方法通常在与峰值相关的结合位点以及隐式定义的非结合位点上进行训练,而没有考虑细胞特异性表达谱。鉴于蛋白质-RNA相互作用的动态性质,这些方法难以准确预测跨细胞系的蛋白质在RNA上的结合核苷酸和结合强度。因此,本研究提出了一种基于深度学习的新方法iDeepB,旨在通过整合细胞系特异性基因表达谱,以碱基分辨率预测蛋白质在RNA上的结合图谱。iDeepB首先基于细胞特异性RNA-seq和eCLIP-seq数据构建表达感知基准数据集,该数据集用于训练具有多头注意力的混合深度网络,可以预测蛋白质结合图谱、分析结合基序语法组成以及量化与人类疾病相关的基因组突变的功能影响。对新开发的基准数据集的综合评估表明,iDeepB在预测蛋白质在RNA上的结合图谱方面优于现有方法。