Suppr超能文献

表格深度学习:应用于多任务全基因组预测的比较研究。

Tabular deep learning: a comparative study applied to multi-task genome-wide prediction.

机构信息

Research Unit of Mathematical Sciences, University of Oulu, P.O. Box 8000, 90014, Univesity of Oulu, Finland.

出版信息

BMC Bioinformatics. 2024 Oct 4;25(1):322. doi: 10.1186/s12859-024-05940-1.

Abstract

PURPOSE

More accurate prediction of phenotype traits can increase the success of genomic selection in both plant and animal breeding studies and provide more reliable disease risk prediction in humans. Traditional approaches typically use regression models based on linear assumptions between the genetic markers and the traits of interest. Non-linear models have been considered as an alternative tool for modeling genomic interactions (i.e. non-additive effects) and other subtle non-linear patterns between markers and phenotype. Deep learning has become a state-of-the-art non-linear prediction method for sound, image and language data. However, genomic data is better represented in a tabular format. The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports successful results on various datasets. Tabular deep learning applications in genome-wide prediction (GWP) are still rare. In this work, we perform an overview of the main families of recent deep learning architectures for tabular data and apply them to multi-trait regression and multi-class classification for GWP on real gene datasets.

METHODS

The study involves an extensive overview of recent deep learning architectures for tabular data learning: NODE, TabNet, TabR, TabTransformer, FT-Transformer, AutoInt, GANDALF, SAINT and LassoNet. These architectures are applied to multi-trait GWP. Comprehensive benchmarks of various tabular deep learning methods are conducted to identify best practices and determine their effectiveness compared to traditional methods.

RESULTS

Extensive experimental results on several genomic datasets (three for multi-trait regression and two for multi-class classification) highlight LassoNet as a standout performer, surpassing both other tabular deep learning models and the highly efficient tree based LightGBM method in terms of both best prediction accuracy and computing efficiency.

CONCLUSION

Through series of evaluations on real-world genomic datasets, the study identifies LassoNet as a standout performer, surpassing decision tree methods like LightGBM and other tabular deep learning architectures in terms of both predictive accuracy and computing efficiency. Moreover, the inherent variable selection property of LassoNet provides a systematic way to find important genetic markers that contribute to phenotype expression.

摘要

目的

更准确地预测表型特征可以提高基因组选择在植物和动物育种研究中的成功率,并为人类提供更可靠的疾病风险预测。传统方法通常使用基于遗传标记与感兴趣性状之间线性假设的回归模型。非线性模型已被视为建模基因组相互作用(即非加性效应)和标记与表型之间其他微妙非线性模式的替代工具。深度学习已成为声音、图像和语言数据的一种先进的非线性预测方法。然而,基因组数据以表格形式表示更好。关于表格数据深度学习的现有文献提出了广泛的新型架构,并在各种数据集上报告了成功的结果。基因组预测(GWP)中表格深度学习的应用仍然很少。在这项工作中,我们对表格数据深度学习的主要架构家族进行了概述,并将其应用于真实基因数据集上的多性状回归和多类分类的 GWP。

方法

该研究涉及对表格数据学习的最新深度学习架构的广泛概述:NODE、TabNet、TabR、TabTransformer、FT-Transformer、AutoInt、GANDALF、SAINT 和 LassoNet。这些架构应用于多性状 GWP。对各种表格深度学习方法进行了全面的基准测试,以确定最佳实践,并确定它们与传统方法相比的有效性。

结果

在几个基因组数据集(三个用于多性状回归,两个用于多类分类)上的广泛实验结果突出了 LassoNet 的出色表现,在预测准确性和计算效率方面均优于其他表格深度学习模型和高效的基于树的 LightGBM 方法。

结论

通过对真实基因组数据集的一系列评估,该研究确定 LassoNet 是一种出色的表现者,在预测准确性和计算效率方面均优于决策树方法(如 LightGBM)和其他表格深度学习架构。此外,LassoNet 的固有变量选择特性提供了一种系统的方法来找到对表型表达有贡献的重要遗传标记。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5268/11452967/c7c03041388a/12859_2024_5940_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验