IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1393-1402. doi: 10.1109/TCBB.2021.3084596. Epub 2022 Jun 3.
Aging is traditionally thought to be caused by complex and interacting factors such as DNA methylation. The traditional formula of DNA methylation aging is based on linear models and little work has explored the effectiveness of neural networks, which can learn non-linear relationships. DNA methylation data typically consists of hundreds of thousands of feature space and a much less number of biological samples. This leads to overfitting and a poor generalization of neural networks. We propose Correlation Pre-Filtered Neural Network (CPFNN) that uses Spearman Correlation to pre-filter the input features before feeding them into neural networks. We compare CPFNN with the statistical regressions (i.e., Horvath's and Hannum's formulas), the neural networks with LASSO regularization and elastic net regularization, and the Dropout Neural Networks. CPFNN outperforms these models by at least 1 year in term of Mean Absolute Error (MAE), with a MAE of 2.7 years. We also test for association between the epigenetic age with Schizophrenia and Down Syndrome ( p=0.024 and , respectively). We discover that for a large number of candidate features, such as genome-wide DNA methylation data, a key factor in improving prediction accuracy is to appropriately weight features that are highly correlated with the outcome of interest.
衰老传统上被认为是由 DNA 甲基化等复杂和相互作用的因素引起的。传统的 DNA 甲基化衰老公式基于线性模型,很少有工作探索神经网络的有效性,神经网络可以学习非线性关系。DNA 甲基化数据通常由数十万特征空间和数量少得多的生物样本组成。这导致神经网络过度拟合和泛化能力差。我们提出了相关预滤波神经网络 (CPFNN),它使用 Spearman 相关在将输入特征输入神经网络之前对其进行预过滤。我们将 CPFNN 与统计回归(即 Horvath 和 Hannum 公式)、具有 LASSO 正则化和弹性网络正则化的神经网络以及 Dropout 神经网络进行比较。CPFNN 在平均绝对误差 (MAE) 方面至少优于这些模型 1 年,MAE 为 2.7 年。我们还测试了表观遗传年龄与精神分裂症和唐氏综合征之间的关联(p=0.024 和 ,分别)。我们发现,对于大量候选特征,如全基因组 DNA 甲基化数据,提高预测准确性的一个关键因素是适当加权与目标结果高度相关的特征。