Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, 112 Paterson St, New Brunswick, NJ, 08901, USA.
Department of Biomedical and Health Informatics, UMKC School of Medicine, 2411 Holmes Street, Kansas City, MO, 64108, USA.
Sci Rep. 2024 Nov 3;14(1):26503. doi: 10.1038/s41598-024-78553-6.
Cardiovascular diseases (CVDs) are complex, multifactorial conditions that require personalized assessment and treatment. Advancements in multi-omics technologies, namely RNA sequencing and whole-genome sequencing, have provided translational researchers with a comprehensive view of the human genome. The efficient synthesis and analysis of this data through integrated approach that characterizes genetic variants alongside expression patterns linked to emerging phenotypes, can reveal novel biomarkers and enable the segmentation of patient populations based on personalized risk factors. In this study, we present a cutting-edge methodology rooted in the integration of traditional bioinformatics, classical statistics, and multimodal machine learning techniques. Our approach has the potential to uncover the intricate mechanisms underlying CVD, enabling patient-specific risk and response profiling. We sourced transcriptomic expression data and single nucleotide polymorphisms (SNPs) from both CVD patients and healthy controls. By integrating these multi-omics datasets with clinical demographic information, we generated patient-specific profiles. Utilizing a robust feature selection approach, we identified a signature of 27 transcriptomic features and SNPs that are effective predictors of CVD. Differential expression analysis, combined with minimum redundancy maximum relevance feature selection, highlighted biomarkers that explain the disease phenotype. This approach prioritizes both biological relevance and efficiency in machine learning. We employed Combination Annotation Dependent Depletion scores and allele frequencies to identify variants with pathogenic characteristics in CVD patients. Classification models trained on this signature demonstrated high-accuracy predictions for CVD. The best performing of these models was an XGBoost classifier optimized via Bayesian hyperparameter tuning, which was able to correctly classify all patients in our test dataset. Using SHapley Additive exPlanations, we created risk assessments for patients, offering further contextualization of these predictions in a clinical setting. Across the cohort, RPL36AP37 and HBA1 were scored as the most important biomarkers for predicting CVDs. A comprehensive literature review revealed that a substantial portion of the diagnostic biomarkers identified have previously been associated with CVD. The framework we propose in this study is unbiased and generalizable to other diseases and disorders.
心血管疾病(CVDs)是复杂的、多因素的病症,需要个性化的评估和治疗。多组学技术的进步,特别是 RNA 测序和全基因组测序,为转化研究人员提供了人类基因组的全面视图。通过整合方法,对这些数据进行高效合成和分析,该方法可以描述与新兴表型相关的遗传变异和表达模式,从而揭示新的生物标志物,并能够根据个性化风险因素对患者群体进行细分。在本研究中,我们提出了一种基于传统生物信息学、经典统计学和多模态机器学习技术的集成方法。我们的方法有可能揭示 CVD 背后的复杂机制,实现患者特异性风险和反应分析。我们从 CVD 患者和健康对照中获取转录组表达数据和单核苷酸多态性(SNP)。通过将这些多组学数据集与临床人口统计学信息集成,我们生成了患者特异性特征。利用稳健的特征选择方法,我们确定了 27 个转录组特征和 SNP 的特征,这些特征是 CVD 的有效预测因子。差异表达分析与最小冗余最大相关性特征选择相结合,突出了解释疾病表型的生物标志物。这种方法优先考虑生物学相关性和机器学习的效率。我们使用组合注释依赖耗尽评分和等位基因频率来识别 CVD 患者中具有致病性特征的变异。基于该特征签名训练的分类模型对 CVD 具有高精度预测。这些模型中表现最好的是经过贝叶斯超参数调优的 XGBoost 分类器,它能够正确地对我们测试数据集中的所有患者进行分类。通过使用 SHapley Additive exPlanations,我们为患者创建了风险评估,为这些预测在临床环境中的进一步解释提供了更多信息。在整个队列中,RPL36AP37 和 HBA1 被评为预测 CVD 最重要的生物标志物。全面的文献综述表明,我们鉴定的一部分诊断生物标志物以前与 CVD 有关。我们在本研究中提出的框架是无偏的,并且可以推广到其他疾病和障碍。