Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China.
Center for Data Science, Zhejiang University, Hangzhou, China.
PLoS Comput Biol. 2021 Feb 18;17(2):e1008767. doi: 10.1371/journal.pcbi.1008767. eCollection 2021 Feb.
N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA's biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression.
N6-甲基腺嘌呤(6mA)是一种与广泛的生物学过程相关的重要 DNA 修饰形式。准确识别基因组范围内的 6mA 位点对于理解 6mA 的生物学功能至关重要。然而,现有的检测 6mA 位点的实验技术成本效益不高,这意味着非常需要开发新的计算方法来解决这个问题。在本文中,我们开发了一种名为 Deep6mA 的深度学习框架,无需事先了解 6mA 并人工制作序列特征,即可识别 DNA 6mA 位点,其性能优于其他 DNA 6mA 预测工具。具体来说,在水稻基准数据集上进行的 5 倍交叉验证,Deep6mA 的灵敏度和特异性分别为 92.96%和 95.06%,整体预测准确率为 94%。重要的是,我们发现具有 6mA 位点的序列在不同物种之间具有相似的模式。使用水稻数据训练的模型可以很好地预测其他三个物种(拟南芥、野草莓和月季)的 6mA 位点,预测准确率超过 90%。此外,我们还发现:(1)6mA 倾向于出现在 GAGG 基序中,这意味着 6mA 位点附近的序列可能具有保守性;(2)6mA 在启动子的 TATA 盒中富集,这可能是其调节下游基因表达的主要来源。