Suppr超能文献

基于衰减学习率的DNA微调语言模型预测DNA结合蛋白的序列特异性

Predicting the Sequence Specificities of DNA-Binding Proteins by DNA Fine-Tuned Language Model With Decaying Learning Rates.

作者信息

He Ying, Zhang Qinhu, Wang Siguo, Chen Zhanheng, Cui Zhen, Guo Zhen-Hao, Huang De-Shuang

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):616-624. doi: 10.1109/TCBB.2022.3165592. Epub 2023 Feb 3.

Abstract

DNA-binding proteins (DBPs) play vital roles in the regulation of biological systems. Although there are already many deep learning methods for predicting the sequence specificities of DBPs, they face two challenges as follows. Classic deep learning methods for DBPs prediction usually fail to capture the dependencies between genomic sequences since their commonly used one-hot codes are mutually orthogonal. Besides, these methods usually perform poorly when samples are inadequate. To address these two challenges, we developed a novel language model for mining DBPs using human genomic data and ChIP-seq datasets with decaying learning rates, named DNA Fine-tuned Language Model (DFLM). It can capture the dependencies between genome sequences based on the context of human genomic data and then fine-tune the features of DBPs tasks using different ChIP-seq datasets. First, we compared DFLM with the existing widely used methods on 69 datasets and we achieved excellent performance. Moreover, we conducted comparative experiments on complex DBPs and small datasets. The results show that DFLM still achieved a significant improvement. Finally, through visualization analysis of one-hot encoding and DFLM, we found that one-hot encoding completely cut off the dependencies of DNA sequences themselves, while DFLM using language models can well represent the dependency of DNA sequences. Source code are available at: https://github.com/Deep-Bioinfo/DFLM.

摘要

DNA结合蛋白(DBP)在生物系统的调控中发挥着至关重要的作用。尽管已经有许多用于预测DBP序列特异性的深度学习方法,但它们面临以下两个挑战。经典的用于DBP预测的深度学习方法通常无法捕捉基因组序列之间的依赖性,因为其常用的独热编码是相互正交的。此外,当样本不足时,这些方法通常表现不佳。为了应对这两个挑战,我们开发了一种新颖的语言模型,利用人类基因组数据和具有衰减学习率的ChIP-seq数据集来挖掘DBP,名为DNA微调语言模型(DFLM)。它可以根据人类基因组数据的上下文捕捉基因组序列之间的依赖性,然后使用不同的ChIP-seq数据集对DBP任务的特征进行微调。首先,我们在69个数据集上比较了DFLM与现有的广泛使用的方法,取得了优异的性能。此外,我们对复杂的DBP和小数据集进行了对比实验。结果表明,DFLM仍然取得了显著的改进。最后,通过对独热编码和DFLM的可视化分析,我们发现独热编码完全切断了DNA序列本身的依赖性,而使用语言模型的DFLM可以很好地表示DNA序列的依赖性。源代码可在以下网址获取:https://github.com/Deep-Bioinfo/DFLM

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验