Suppr超能文献

基于 MapReduce 的并行遗传算法在年龄预测中的 CpG 位点选择。

MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction.

机构信息

Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran P.O. Box 14115-143, Iran.

Institute for Research in Fundamental Sciences (IPM), School of Computer Science, Tehran P.O. Box 14115-143, Iran.

出版信息

Genes (Basel). 2019 Nov 25;10(12):969. doi: 10.3390/genes10120969.

Abstract

Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R) of 95.96% between age and DNAm. In the train data, the MAD and R are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable.

摘要

基因组生物标志物,如 DNA 甲基化(DNAm),被用于年龄预测。近年来,有几项研究表明 DNAm 的变化与其对人类年龄的影响之间存在关联。这种类型的数据的高维性质显著增加了建模算法的执行时间。为了解决这个问题,我们提出了一种两阶段并行算法,用于选择与年龄相关的 CpG 位点。该算法首先尝试将数据聚类到相似的年龄范围内。在下一阶段,使用基于 MapReduce 范例(基于 MR 的 PGA)的并行遗传算法(PGA)来选择每个年龄范围的与年龄相关的特征。在提出的方法中,使用新颖的并行框架执行每个年龄范围的算法(数据并行)、评估染色体(任务并行)和计算适应度函数(数据并行)。在本文中,我们考虑了 16 个不同的与人类血液组织相关的健康 DNAm 数据集,这些数据集包含相关的年龄信息。这些数据集被组合成一个单一的联合集,然后随机分为两个训练集和测试集,比例分别为 7:3。我们在训练集上选择的 CpG 位点上构建了一个梯度提升回归器(GBR)模型。为了评估模型的准确性,我们将我们的结果与使用这些数据集的最新方法进行了比较,并观察到我们的方法在未见过的测试数据集上表现更好,平均绝对偏差(MAD)为 3.62 岁,年龄和 DNAm 之间的相关性(R)为 95.96%。在训练数据中,MAD 和 R 分别为 1.27 岁和 99.27%。最后,我们根据并行化在计算时间方面的效果来评估我们的方法。没有并行化的算法需要 4123 分钟才能完成,而在 3 台具有 32 个处理核心的计算机上并行执行仅需要总共 58 分钟。这表明我们提出的算法既高效又可扩展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6abc/6947642/9cf2c417a29e/genes-10-00969-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验