School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
School of Chemical Engineering and Technology, Tianjin University, Tianjin, China.
BMC Genomics. 2019 Apr 23;20(1):306. doi: 10.1186/s12864-019-5654-9.
DNA methylation plays an important role in multiple biological processes that are closely related to human health. The study of DNA methylation can provide an insight into the mechanism behind human health and can also have a positive effect on the assessment of human health status. However, the available sequencing technology is limited by incomplete CpG coverage. Therefore, it is crucial to discover an efficient and convenient method capable of distinguishing between the states of CpG sites. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information.
In this paper, we propose a novel model, LightCpG, which combines the positional features with the sequence and structural features to provide information on the CpG sites at two stages. Next, we used the LightGBM model for training of the CpG site identification, and further utilized sample extraction and merged features to reduce the training time. Our results indicate that our method achieves outstanding performance in recognition of DNA methylation. The average AUC values of our method using the 25 human hepatocellular carcinoma cells (HCC) cell datasets and six human heptoplastoma-derived (HepG2) cell datasets were 0.9616 and 0.9213, respectively. Moreover, the average training times for our method on the HCC and HepG2 datasets were 8.3 and 5.06 s, respectively. Furthermore, the computational complexity of our model was much lower compared with other available methods that detect methylation states of the CpG sites.
In summary, LightCpG is an accurate model for identifying the DNA methylation status of CpG sites in single cells. Furthermore, three types of feature extraction methods and two strategies used in LightCpG are helpful for other prediction problems.
DNA 甲基化在与人类健康密切相关的多种生物学过程中发挥着重要作用。对 DNA 甲基化的研究可以深入了解人类健康背后的机制,也可以对评估人类健康状况产生积极影响。然而,现有的测序技术受到 CpG 覆盖不完全的限制。因此,发现一种能够区分 CpG 位点状态的高效便捷方法至关重要。以前的研究集中在识别单细胞中 CpG 位点的甲基化状态,这些研究仅评估了序列信息或结构信息。
在本文中,我们提出了一种新的模型 LightCpG,该模型将位置特征与序列和结构特征相结合,提供了 CpG 位点在两个阶段的信息。接下来,我们使用 LightGBM 模型对 CpG 位点进行识别训练,并进一步利用样本提取和合并特征来减少训练时间。我们的结果表明,我们的方法在识别 DNA 甲基化方面表现出色。使用 25 个人肝癌细胞(HCC)细胞数据集和 6 个人肝母细胞瘤衍生(HepG2)细胞数据集,我们方法的平均 AUC 值分别为 0.9616 和 0.9213。此外,我们的方法在 HCC 和 HepG2 数据集上的平均训练时间分别为 8.3 和 5.06 秒。此外,与其他检测 CpG 位点甲基化状态的可用方法相比,我们模型的计算复杂度要低得多。
总之,LightCpG 是一种用于识别单细胞中 CpG 位点 DNA 甲基化状态的准确模型。此外,LightCpG 中使用的三种特征提取方法和两种策略有助于解决其他预测问题。