利用混合效应回归树分析高维纵向数据以识别低风险和高风险亚组：应用于基因研究的模拟研究

Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.

作者信息

Jahangiri Mina, Kazemnejad Anoshirvan, Goldfeld Keith S, Daneshpour Maryam S, Momen Mehdi, Mostafaei Shayan, Khalili Davood, Akbarzadeh Mahdi

机构信息

Department of Biostatistics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran.

Division of Biostatistics, Department of Population Health, NYU Grossman School of Medicine, New York, NY, USA.

出版信息

BioData Min. 2025 Mar 19;18(1):22. doi: 10.1186/s13040-025-00437-w.

DOI:10.1186/s13040-025-00437-w

PMID:40108712

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11924713/

Abstract

BACKGROUND

The linear mixed-effects model (LME) is a conventional parametric method mainly used for analyzing longitudinal and clustered data in genetic studies. Previous studies have shown that this model can be sensitive to parametric assumptions and provides less predictive performance than non-parametric methods such as random effects-expectation maximization (RE-EM) and unbiased RE-EM regression tree algorithms. These longitudinal regression trees utilize classification and regression trees (CART) and conditional inference trees (Ctree) to estimate the fixed-effects components of the mixed-effects model. While CART is a well-known tree algorithm, it suffers from greediness. To mitigate this issue, we used the Evtree algorithm to estimate the fixed-effects part of the LME for handling longitudinal and clustered data in genome association studies.

METHODS

In this study, we propose a new non-parametric longitudinal-based algorithm called "Ev-RE-EM" for modeling a continuous response variable using the Evtree algorithm to estimate the fixed-effects part of the LME. We compared its predictive performance with other tree algorithms, such as RE-EM and unbiased RE-EM, with and without considering the structure for autocorrelation between errors within subjects to analyze the longitudinal data in the genetic study. The autocorrelation structures include a first-order autoregressive process, a compound symmetric structure with a constant correlation, and a general correlation matrix. The real data was obtained from the longitudinal Tehran cardiometabolic genetic study (TCGS). The data modeling used body mass index (BMI) as the phenotype and included predictor variables such as age, sex, and 25,640 single nucleotide polymorphisms (SNPs).

RESULTS

The results demonstrated that the predictive performance of Ev-RE-EM and unbiased RE-EM was nearly similar. Additionally, the Ev-RE-EM algorithm generated smaller trees than the unbiased RE-EM algorithm, enhancing tree interpretability.

CONCLUSION

The results showed that the unbiased RE-EM and Ev-RE-EM algorithms outperformed the RE-EM algorithm. Since algorithm performance varies across datasets, researchers should test different algorithms on the dataset of interest and select the best-performing one. Accurately predicting and diagnosing an individual's genetic profile is crucial in medical studies. The model with the highest accuracy should be used to enhance understanding of the genetics of complex traits, improve disease prevention and diagnosis, and aid in treating complex human diseases.

摘要

背景

线性混合效应模型（LME）是一种传统的参数方法，主要用于分析基因研究中的纵向数据和聚类数据。先前的研究表明，该模型对参数假设敏感，并且与非参数方法（如随机效应期望最大化（RE-EM）和无偏RE-EM回归树算法）相比，预测性能较差。这些纵向回归树利用分类回归树（CART）和条件推断树（Ctree）来估计混合效应模型的固定效应成分。虽然CART是一种著名的树算法，但它存在贪婪性问题。为了缓解这个问题，我们使用Evtree算法来估计LME的固定效应部分，以处理基因组关联研究中的纵向数据和聚类数据。

方法

在本研究中，我们提出了一种新的基于纵向数据的非参数算法“Ev-RE-EM”，用于使用Evtree算法估计LME的固定效应部分来对连续响应变量进行建模。我们将其预测性能与其他树算法（如RE-EM和无偏RE-EM）进行比较，在考虑和不考虑个体内误差之间的自相关结构的情况下，分析基因研究中的纵向数据。自相关结构包括一阶自回归过程、具有恒定相关性的复合对称结构和一般相关矩阵。真实数据来自德黑兰心脏代谢基因纵向研究（TCGS）。数据建模使用体重指数（BMI）作为表型，并包括年龄、性别和25640个单核苷酸多态性（SNP）等预测变量。

结果

结果表明，Ev-RE-EM和无偏RE-EM的预测性能几乎相似。此外，Ev-RE-EM算法生成的树比无偏RE-EM算法更小，增强了树的可解释性。

结论

结果表明，无偏RE-EM和Ev-RE-EM算法优于RE-EM算法。由于算法性能因数据集而异，研究人员应在感兴趣的数据集上测试不同的算法，并选择性能最佳的算法。在医学研究中，准确预测和诊断个体的基因特征至关重要。应使用准确率最高的模型来增强对复杂性状遗传学的理解，改善疾病预防和诊断，并辅助治疗复杂的人类疾病。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6939/11924713/806fff79f91f/13040_2025_437_Figa_HTML.jpg

相似文献

Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.利用混合效应回归树分析高维纵向数据以识别低风险和高风险亚组：应用于基因研究的模拟研究

BioData Min. 2025 Mar 19;18(1):22. doi: 10.1186/s13040-025-00437-w.

Modified tree-based selection in hierarchical mixed-effect models with trees: A simulation study and real-data application.基于树的分层混合效应模型中的改进树选择：模拟研究与实际数据应用

MethodsX. 2025 Apr 12;14:103312. doi: 10.1016/j.mex.2025.103312. eCollection 2025 Jun.

A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis.多种缺失值插补方法在纵向数据分析中的应用：一项模拟研究与真实数据分析。

BMC Med Res Methodol. 2023 Jul 6;23(1):161. doi: 10.1186/s12874-023-01968-8.

Genome-wide prediction using Bayesian additive regression trees.使用贝叶斯加法回归树进行全基因组预测。

Genet Sel Evol. 2016 Jun 10;48(1):42. doi: 10.1186/s12711-016-0219-8.

Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees.基于广义线性混合效应模型树检测聚类数据中的治疗亚组交互作用。

Behav Res Methods. 2018 Oct;50(5):2016-2034. doi: 10.3758/s13428-017-0971-x.

Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies.用于多位点全基因组关联研究的迭代确定独立筛选EM-贝叶斯套索算法

PLoS Comput Biol. 2017 Jan 31;13(1):e1005357. doi: 10.1371/journal.pcbi.1005357. eCollection 2017 Jan.

A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms.基于 EM 算法的基于测序数据的等位基因频率估计、SNP 检测和关联研究的统一方法。

BMC Genomics. 2013;14 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2164-14-S1-S1. Epub 2013 Jan 21.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes.基于 Hadoop 的随机森林在多变量神经影像学表型全基因组关联研究中的应用。

BMC Bioinformatics. 2013;14 Suppl 16(Suppl 16):S6. doi: 10.1186/1471-2105-14-S16-S6. Epub 2013 Oct 22.

A semiparametric mixed-effects model for censored longitudinal data.半参数混合效应模型在删失纵向数据中的应用。

Stat Methods Med Res. 2021 Dec;30(12):2582-2603. doi: 10.1177/09622802211046387. Epub 2021 Oct 18.

引用本文的文献

Development and validation of a model to predict the progression of Alzheimer's disease.预测阿尔茨海默病进展模型的开发与验证

Age Ageing. 2025 Jul 1;54(7). doi: 10.1093/ageing/afaf198.

本文引用的文献

BMC Med Res Methodol. 2023 Jul 6;23(1):161. doi: 10.1186/s12874-023-01968-8.

Cohort profile update: Tehran cardiometabolic genetic study.队列资料更新：德黑兰心脏代谢遗传学研究。

Eur J Epidemiol. 2023 Jun;38(6):699-711. doi: 10.1007/s10654-023-01008-1. Epub 2023 May 12.

Application of Bayesian Decision Tree in Hematology Research: Differential Diagnosis of -Thalassemia Trait from Iron Deficiency Anemia.贝叶斯决策树在血液学研究中的应用：从缺铁性贫血中鉴别 -地中海贫血特征。

Comput Math Methods Med. 2021 Nov 9;2021:6401105. doi: 10.1155/2021/6401105. eCollection 2021.

Diagnostic performance of classification trees and hematological functions in hematologic disorders: an application of multidimensional scaling and cluster analysis.分类树和血液学功能在血液系统疾病中的诊断性能：多维尺度分析和聚类分析的应用

BMC Med Inform Decis Mak. 2021 Nov 10;21(1):313. doi: 10.1186/s12911-021-01678-5.

A Lasso and a Regression Tree Mixed-Effect Model with Random Effects for the Level, the Residual Variance, and the Autocorrelation.具有水平、残差方差和自相关性随机效应的套索和回归树混合效应模型。

Psychometrika. 2022 Jun;87(2):506-532. doi: 10.1007/s11336-021-09787-w. Epub 2021 Aug 14.

Random forests for high-dimensional longitudinal data.随机森林在高维纵向数据中的应用。

Stat Methods Med Res. 2021 Jan;30(1):166-184. doi: 10.1177/0962280220946080. Epub 2020 Aug 9.

Generalized linear mixed-model (GLMM) trees: A flexible decision-tree method for multilevel and longitudinal data.广义线性混合模型（GLMM）树：一种用于多层次和纵向数据的灵活决策树方法。

Psychother Res. 2021 Mar;31(3):313-325. doi: 10.1080/10503307.2020.1785037. Epub 2020 Jun 30.

Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies.基于等级的逆正态变换在全基因组关联研究中数量性状分析的操作特征。

Biometrics. 2020 Dec;76(4):1262-1272. doi: 10.1111/biom.13214. Epub 2020 Jan 13.

A resource-efficient tool for mixed model association analysis of large-scale data.一种资源高效的工具，用于大规模数据的混合模型关联分析。

Nat Genet. 2019 Dec;51(12):1749-1755. doi: 10.1038/s41588-019-0530-8. Epub 2019 Nov 25.

Regression Trees for Longitudinal Data with Baseline Covariates.具有基线协变量的纵向数据的回归树

Biostat Epidemiol. 2019;3(1):1-22. doi: 10.1080/24709360.2018.1557797. Epub 2018 Dec 31.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

利用混合效应回归树分析高维纵向数据以识别低风险和高风险亚组：应用于基因研究的模拟研究

Leveraging mixed-effects regression trees for the analysis of high-dimensional longitudinal data to identify the low and high-risk subgroups: simulation study with application to genetic study.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献