Department of Evolutionary & Environmental Biology, Faculty of Natural Sciences, University of Haifa, Haifa, Israel.
BMC Genomics. 2020 Apr 16;21(Suppl 2):257. doi: 10.1186/s12864-020-6606-0.
DNA methylation is widely used as a biomarker in crucial medical applications as well as for human age prediction of very high accuracy. This biomarker is based on the methylation status of several hundred CpG sites. In a recent line of publications we have adapted a versatile concept from evolutionary biology - the Universal Pacemaker (UPM) - to the setting of epigenetic aging and denoted it the Epigenetic PaceMaker (EPM). The EPM, as opposed to other epigenetic clocks, is not confined to specific pattern of aging, and the epigenetic age of the individual is inferred independently of other individuals. This allows an explicit modeling of aging trends, in particular non linear relationship between chronological and epigenetic age. In one of these recent works, we have presented an algorithmic improvement based on a two-step conditional expectation maximization (CEM) algorithm to arrive at a critical point on the likelihood surface. The algorithm alternates between a time step and a site step while advancing on the likelihood surface.
Here we introduce non trivial improvements to these steps that are essential for analyzing data sets of realistic magnitude in a manageable time and space. These structural improvements are based on insights from linear algebra and symbolic algebra tools, providing us greater understanding of the degeneracy of the complex problem space. This understanding in turn, leads to the complete elimination of the bottleneck of cumbersome matrix multiplication and inversion, yielding a fast closed form solution in both steps of the CEM.In the experimental results part, we compare the CEM algorithm over several data sets and demonstrate the speedup obtained by the closed form solutions. Our results support the theoretical analysis of this improvement.
These improvements enable us to increase substantially the scale of inputs analyzed by the method, allowing us to apply the new approach to data sets that could not be analyzed before.
DNA 甲基化被广泛用作关键医学应用中的生物标志物,以及高精度的人类年龄预测。这种生物标志物基于数百个 CpG 位点的甲基化状态。在最近的一系列出版物中,我们从进化生物学中采用了一种通用的概念——通用起搏器(UPM)——并将其应用于表观遗传衰老领域,并将其命名为表观遗传起搏器(EPM)。与其他表观遗传时钟不同,EPM 不受特定衰老模式的限制,个体的表观遗传年龄是独立于其他个体推断出来的。这允许明确建模衰老趋势,特别是在年龄与表观遗传年龄之间存在非线性关系的情况下。在最近的一项研究中,我们提出了一种基于两步条件期望最大化(CEM)算法的算法改进,以达到似然面的临界点。该算法在似然面上交替进行时间步长和位置步长。
在这里,我们对这些步骤进行了非平凡的改进,这些改进对于在可管理的时间和空间内分析具有实际规模的数据非常重要。这些结构改进基于线性代数和符号代数工具的见解,为我们提供了对复杂问题空间退化的更好理解。这种理解反过来又导致了繁琐矩阵乘法和求逆的瓶颈的完全消除,在 CEM 的两个步骤中都产生了快速的闭式解。在实验结果部分,我们比较了 CEM 算法在几个数据集上的表现,并展示了闭式解获得的加速。我们的结果支持了对这一改进的理论分析。
这些改进使我们能够大大增加方法分析的输入规模,从而使我们能够将新方法应用于以前无法分析的数据集。