Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.
College of Information Technology in the United Arab Emirates University (UAEU), Abu Dhabi 15551, UAE.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad474.
The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately.
In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers.
The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.
对 DNA 甲基化的研究可以揭示人类健康的潜在过程,并有助于确定整体人类健康。然而,由于覆盖度不足,实施单链 DNA 甲基化测序技术具有挑战性,这凸显了对高效预测模型的需求。模型需要帮助我们理解潜在的生物系统,并准确预测单细胞(甲基化)数据。
在这项研究中,我们开发了用于预测 CpG 位点的位置特征。序列的位置特征是使用 CpG 区域的数据和附近 CpG 位点之间的分隔推导出来的。评估了多个优化分类器和不同的集成学习方法。使用 OPTUNA 框架来优化算法。CatBoost 算法后面跟着堆叠算法,其表现优于现有的 DNA 甲基化标识符。
本研究中使用的数据和方法对研究界开放。研究人员可以访问用于预测 CpG 位点甲基化模式的位置特征和算法。为了获得卓越的性能,我们采用了 CatBoost 算法后面跟着堆叠算法,其表现优于现有的 DNA 甲基化标识符。所提出的 iCpG-Pos 方法仅使用位置特征,与其他已知的检测 CpG 位点甲基化模式的方法相比,大大降低了计算复杂度。总之,我们的研究引入了一种新的方法 iCpG-Pos,用于预测 CpG 位点的甲基化模式。通过关注位置特征,我们的模型提供了准确性和效率,这使其成为推进 DNA 甲基化研究及其在人类健康和福祉中的应用的有前途的工具。