Engineering Research Center of Internet of Things Technology Applications (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, Jiangsu, China.
School of Medicine and Pharmaceuticals, Jiangnan University, Wuxi, 214122, Jiangsu, China.
Mol Genet Genomics. 2018 Aug;293(4):1035-1049. doi: 10.1007/s00438-018-1436-3. Epub 2018 Mar 29.
DNase I hypersensitive sites (DHSs) are hallmarks of chromatin zones containing transcriptional regulatory elements, making them critical in understanding the regulatory mechanisms of gene expression. Although large amounts of DHSs in the plant genome have been identified by high-throughput techniques, current DHSs obtained from experimental methods cover only a fraction of plant species and cell processes. Furthermore, these experimental methods are both time-consuming and expensive. Hence, it is urgent to develop automated computational means to efficiently and accurately predict DHSs in the plant genome. Recently, several methods have been proposed to predict the DHSs. However, all these methods took a lot of time to build the model, making them inappropriate for data with massive volume. In the present work, a new ensemble extreme learning machine (ELM)-based model called pDHS-ELM was proposed to predict the DHSs in the plant genome by fusing two different modes of pseudo-nucleotide composition. Here, two kinds of features including reverse complement kmer and pseudo-nucleotide composition were used to represent the DHSs. The ELM model was used to build the base classifiers. Then, an ensemble framework was employed to combine the outputs of these base classifiers. When applied to DHSs in Arabidopsis thaliana and rice (Oryza sativa) genome, the proposed method could obtain accuracies up to 88.48 and 87.58%, respectively. Compared with the state-of-the-art techniques, pDHS-ELM achieved higher sensitivity, specificity, and Matthew's correlation coefficient with much less training and test time. By employing pDHS-ELM, we identified 42,370 and 103,979 DHSs in A. thaliana and rice genome, respectively. The predicted DHSs were depleted of bulk nucleosomes and were tightly associated with transcription factors. Approximately 90% of the predicted DHSs were overlapped with transcription factors. Meanwhile, we demonstrated that the predicted DHSs were also associated with DNA methylation, nucleosome positioning/occupancy, and histone modification. This result suggests that pDHS-ELM can be considered as a new promising and powerful tool for transcriptional regulatory elements analysis. Our pDHS-ELM tool is available from the following website https://github.com/shanxinzhang/pDHS-ELM/ .
DNase I 超敏位点(DHSs)是含有转录调控元件的染色质区域的标志,对于理解基因表达的调控机制至关重要。尽管高通量技术已经鉴定出大量植物基因组中的 DHSs,但目前通过实验方法获得的 DHSs 仅涵盖了一部分植物物种和细胞过程。此外,这些实验方法既耗时又昂贵。因此,迫切需要开发自动化的计算方法来有效地、准确地预测植物基因组中的 DHSs。最近,已经提出了几种预测 DHSs 的方法。然而,所有这些方法都需要大量时间来构建模型,因此不适合处理大量数据。在本研究中,我们提出了一种新的基于集成极端学习机(ELM)的模型,称为 pDHS-ELM,该模型通过融合两种不同的拟核苷酸组成模式来预测植物基因组中的 DHSs。在这里,我们使用了两种特征,包括反向互补 kmer 和拟核苷酸组成,来表示 DHSs。ELM 模型被用于构建基本分类器。然后,采用集成框架来组合这些基本分类器的输出。当应用于拟南芥和水稻基因组中的 DHSs 时,该方法可以分别获得高达 88.48%和 87.58%的准确率。与最先进的技术相比,pDHS-ELM 具有更高的灵敏度、特异性和马修相关系数,同时训练和测试时间更少。使用 pDHS-ELM,我们分别在拟南芥和水稻基因组中鉴定出 42370 个和 103979 个 DHSs。预测的 DHSs 中去除了大量核小体,并且与转录因子紧密相关。大约 90%的预测 DHSs 与转录因子重叠。同时,我们还证明了预测的 DHSs 也与 DNA 甲基化、核小体定位/占据和组蛋白修饰有关。这一结果表明,pDHS-ELM 可以被视为一种分析转录调控元件的新的、有前途的强大工具。我们的 pDHS-ELM 工具可从以下网站获得:https://github.com/shanxinzhang/pDHS-ELM/ 。