随机森林增强酒精消费相关基因的选择。

Enhancing selection of alcohol consumption-associated genes by random forest.

机构信息

Department of Biostatistics, Boston University School of Public Health, Boston, MA02118, USA.

Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA02118, USA.

出版信息

Br J Nutr. 2024 Jun 28;131(12):2058-2067. doi: 10.1017/S0007114524000795. Epub 2024 Apr 12.

DOI:10.1017/S0007114524000795

PMID:38606596

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11216877/

Abstract

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers . moderate drinkers, non-drinkers . heavy drinkers and moderate drinkers . heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of , , and , and and were positively associated with obesity, and was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

摘要

机器学习方法已被用于识别各种表型的组学标志物。我们旨在研究监督机器学习算法是否能提高鉴定与酒精相关的转录组标志物的能力。在这项研究中，我们分析了 5508 名弗雷明汉心脏研究参与者的基于阵列的全血衍生表达数据，涉及 17873 个基因转录本。通过使用 Boruta 算法，一种基于监督随机森林（RF）的特征选择方法，我们选择了 25 个与酒精相关的转录本。在测试集中（整个研究参与者的 30%），这些 25 个转录本的非饮酒者、适度饮酒者、非饮酒者. 重度饮酒者和适度饮酒者. 重度饮酒者的 AUC（接受者操作特征曲线下的面积）分别为 0.73、0.69 和 0.66。Boruta 方法选择的转录本的 AUC 与使用传统线性回归模型识别的 AUC 相当，例如，使用传统线性回归模型识别的 1958 个转录本的 AUC 分别为 0.74、0.66 和 0.65。对于 Boruta 方法选择的 25 个转录本和三个 CVD 风险因素（即 < 6.7e-4），我们进行了 Bonferroni 校正，观察到 13 个转录本与肥胖有关，3 个转录本与 2 型糖尿病有关，1 个转录本与高血压有关。例如，我们观察到饮酒与、和的表达呈负相关，而和与肥胖呈正相关，与高血压呈负相关。总之，使用监督机器学习方法，即基于 RF 的 Boruta 算法，我们鉴定了新的与酒精相关的基因转录本。