Suppr超能文献

胚胎毒性的预测生物标志物:一种减轻 RNA-Seq 中多重共线性的机器学习方法。

Predictive biomarkers for embryotoxicity: a machine learning approach to mitigating multicollinearity in RNA-Seq.

机构信息

Developmental and Reproductive Toxicology Research Group, Korea Institute of Toxicology, Daejeon, 34114, Republic of Korea.

Institute for Advanced Studies, Universiti Malaya, 50603, Kuala Lumpur, Malaysia.

出版信息

Arch Toxicol. 2024 Dec;98(12):4093-4105. doi: 10.1007/s00204-024-03852-w. Epub 2024 Sep 6.

Abstract

Multicollinearity, characterized by significant co-expression patterns among genes, often occurs in high-throughput expression data, potentially impacting the predictive model's reliability. This study examined multicollinearity among closely related genes, particularly in RNA-Seq data obtained from embryoid bodies (EB) exposed to 5-fluorouracil perturbation to identify genes associated with embryotoxicity. Six genes-Dppa5a, Gdf3, Zfp42, Meis1, Hoxa2, and Hoxb1-emerged as candidates based on domain knowledge and were validated using qPCR in EBs perturbed by 39 test substances. We conducted correlation studies and utilized the variance inflation factor (VIF) to examine the existence of multicollinearity among the genes. Recursive feature elimination with cross-validation (RFECV) ranked Zfp42 and Hoxb1 as the top two among the seven features considered, identifying them as potential early embryotoxicity assessment biomarkers. As a result, a t test assessing the statistical significance of this two-feature prediction model yielded a p value of 0.0044, confirming the successful reduction of redundancies and multicollinearity through RFECV. Our study presents a systematic methodology for using machine learning techniques in transcriptomics data analysis, enhancing the discovery of potential reporter gene candidates for embryotoxicity screening research, and improving the predictive model's predictive accuracy and feasibility while reducing financial and time constraints.

摘要

多线性,其特征是基因之间存在显著的共表达模式,经常出现在高通量表达数据中,可能会影响预测模型的可靠性。本研究检查了密切相关基因之间的多线性,特别是在胚胎体(EB)中暴露于 5-氟尿嘧啶扰动后获得的 RNA-Seq 数据中,以鉴定与胚胎毒性相关的基因。根据领域知识,六个基因-Dppa5a、Gdf3、Zfp42、Meis1、Hoxa2 和 Hoxb1-作为候选基因出现,并在 39 种测试物质扰动的 EB 中使用 qPCR 进行了验证。我们进行了相关性研究,并利用方差膨胀因子(VIF)来检查基因之间是否存在多线性。递归特征消除与交叉验证(RFECV)将 Zfp42 和 Hoxb1 排在考虑的七个特征中的前两位,将它们确定为潜在的早期胚胎毒性评估生物标志物。因此,评估该两特征预测模型统计显著性的 t 检验得出的 p 值为 0.0044,证实了通过 RFECV 成功减少了冗余和多线性。我们的研究提出了一种系统的方法,用于在转录组学数据分析中使用机器学习技术,增强了对胚胎毒性筛选研究中潜在报告基因候选物的发现,并提高了预测模型的预测准确性和可行性,同时减少了财务和时间限制。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验