Hong Jianwei, Gao Ruitian, Yang Yang
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China.
School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China.
Bioinformatics. 2021 Oct 25;37(20):3436-3443. doi: 10.1093/bioinformatics/btab349.
Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes.
In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the area under the receiver operating characteristic (ROC) curve (AUC) score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets.
bcmi.sjtu.edu.cn/∼yangyang/CrepHAN.html.
Supplementary data are available at Bioinformatics online.
增强子是基因组序列中的重要功能元件。由于增强子序列的高度多样性以及在基因组上的灵活定位,增强子的识别是一项极具挑战性的任务。到目前为止,增强子与基因之间的相互作用尚未完全被理解。为了加速对增强子调控作用的研究,近年来出现了用于预测增强子的计算工具。特别是,得益于ENCODE项目和高通量实验技术的进步,大量经过实验验证的增强子已在人类基因组上进行了注释,这使得使用数据驱动方法对未知增强子进行大规模预测成为可能。然而,除了人类和一些模式生物外,大多数物种的经过验证的增强子注释稀缺,这使得对其基因组增强子进行计算识别更加困难。
在本研究中,我们提出了一种基于深度学习的增强子预测器,名为CrepHAN,其特点是具有分层注意力神经网络和基于词嵌入的DNA序列表示。我们使用人类基因组的实验支持数据来训练模型,并在人类和其他哺乳动物(包括小鼠、牛和狗)上进行实验。实验结果表明,CrepHAN在跨物种预测方面具有更多优势,并且在很大程度上优于现有模型。特别是,对于人类 - 小鼠的交叉预测,在组合组织数据集上,受试者操作特征(ROC)曲线下面积(AUC)得分提高了0.033∼0.145,在组织特异性数据集上提高了0.032∼0.109。
bcmi.sjtu.edu.cn/∼yangyang/CrepHAN.html。
补充数据可在《生物信息学》在线获取。