Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University, 603022 Nizhny Novgorod, Russia.
IRCCS Istituto delle Scienze Neurologiche di Bologna, 40139 Bologna, Italy.
Gigascience. 2022 Oct 19;11. doi: 10.1093/gigascience/giac097.
DNA methylation has a significant effect on gene expression and can be associated with various diseases. Meta-analysis of available DNA methylation datasets requires development of a specific workflow for joint data processing.
We propose a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence (XAI) algorithms. We show that harmonization can improve classification accuracy by up to 20% when preprocessing methods of the training and test datasets are different. The best accuracy results were obtained with tree ensembles, reaching above 95% for Parkinson's disease. Dimensionality reduction can substantially decrease the number of features, without detriment to the classification accuracy. The best imputation methods achieve almost the same classification accuracy for data with missing values as for the original data. XAI approaches have allowed us to explain model predictions from both populational and individual perspectives.
We propose a methodologically valid and comprehensive approach to the classification of healthy individuals and patients with various diseases based on whole-blood DNA methylation data using Parkinson's disease and schizophrenia as examples. The proposed algorithm works better for the former pathology, characterized by a complex set of symptoms. It allows to solve data harmonization problems for meta-analysis of many different datasets, impute missing values, and build classification models of small dimensionality.
DNA 甲基化对基因表达有显著影响,并且与各种疾病相关联。对现有 DNA 甲基化数据集进行荟萃分析需要开发专门的联合数据处理工作流程。
我们提出了一种综合方法,用于联合 DNA 甲基化数据集以对对照和患者进行分类。该解决方案包括数据协调、构建机器学习分类模型、模型降维、缺失值插补以及通过可解释人工智能 (XAI) 算法解释模型预测。我们表明,当训练和测试数据集的预处理方法不同时,协调可以将分类精度提高高达 20%。使用树集成方法可以获得最佳的精度结果,对帕金森病的分类准确率超过 95%。降维可以大大减少特征数量,而不会降低分类精度。最佳的插补方法可以实现与原始数据几乎相同的缺失值数据的分类精度。XAI 方法使我们能够从群体和个体角度解释模型预测。
我们提出了一种基于全血 DNA 甲基化数据的方法,用于对各种疾病的健康个体和患者进行分类,以帕金森病和精神分裂症为例。所提出的算法对于前者的病理更有效,其特征是一组复杂的症状。它可以解决许多不同数据集荟萃分析中的数据协调问题、插补缺失值以及构建小维度的分类模型。