Yao Zhou, Zhang Wenjing, Song Peng, Hu Yuxue, Liu Jianxiao
Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan 430070, China.
Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan 430070, China.
Brief Bioinform. 2023 Mar 19;24(2). doi: 10.1093/bib/bbad095.
Identifying the function of DNA sequences accurately is an essential and challenging task in the genomic field. Until now, deep learning has been widely used in the functional analysis of DNA sequences, including DeepSEA, DanQ, DeepATT and TBiNet. However, these methods have the problems of high computational complexity and not fully considering the distant interactions among chromatin features, thus affecting the prediction accuracy. In this work, we propose a hybrid deep neural network model, called DeepFormer, based on convolutional neural network (CNN) and flow-attention mechanism for DNA sequence function prediction. In DeepFormer, the CNN is used to capture the local features of DNA sequences as well as important motifs. Based on the conservation law of flow network, the flow-attention mechanism can capture more distal interactions among sequence features with linear time complexity. We compare DeepFormer with the above four kinds of classical methods using the commonly used dataset of 919 chromatin features of nearly 4.9 million noncoding DNA sequences. Experimental results show that DeepFormer significantly outperforms four kinds of methods, with an average recall rate at least 7.058% higher than other methods. Furthermore, we confirmed the effectiveness of DeepFormer in capturing functional variation using Alzheimer's disease, pathogenic mutations in alpha-thalassemia and modification in CCCTC-binding factor (CTCF) activity. We further predicted the maize chromatin accessibility of five tissues and validated the generalization of DeepFormer. The average recall rate of DeepFormer exceeds the classical methods by at least 1.54%, demonstrating strong robustness.
准确识别DNA序列的功能是基因组领域一项至关重要且具有挑战性的任务。到目前为止,深度学习已广泛应用于DNA序列的功能分析,包括DeepSEA、DanQ、DeepATT和TBiNet。然而,这些方法存在计算复杂度高以及未充分考虑染色质特征之间远距离相互作用的问题,从而影响预测准确性。在这项工作中,我们提出了一种基于卷积神经网络(CNN)和流注意力机制的混合深度神经网络模型,称为DeepFormer,用于DNA序列功能预测。在DeepFormer中,CNN用于捕获DNA序列的局部特征以及重要基序。基于流网络的守恒定律,流注意力机制能够以线性时间复杂度捕获序列特征之间更多的远距离相互作用。我们使用包含近490万个非编码DNA序列的919个染色质特征的常用数据集,将DeepFormer与上述四种经典方法进行比较。实验结果表明,DeepFormer显著优于这四种方法,平均召回率比其他方法至少高7.058%。此外,我们通过阿尔茨海默病、α地中海贫血的致病突变以及CCCTC结合因子(CTCF)活性的修饰,证实了DeepFormer在捕获功能变异方面的有效性。我们进一步预测了五个组织的玉米染色质可及性,并验证了DeepFormer的泛化能力。DeepFormer的平均召回率比经典方法至少高出1.54%,证明了其强大的稳健性。