Suppr超能文献

基于表格基础模型对小数据进行准确预测。

Accurate predictions on small data with a tabular foundation model.

作者信息

Hollmann Noah, Müller Samuel, Purucker Lennart, Krishnakumar Arjun, Körfer Max, Hoo Shi Bin, Schirrmeister Robin Tibor, Hutter Frank

机构信息

Machine Learning Lab, University of Freiburg, Freiburg, Germany.

Computational Medicine, Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany.

出版信息

Nature. 2025 Jan;637(8045):319-326. doi: 10.1038/s41586-024-08328-6. Epub 2025 Jan 8.

Abstract

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories, gradient-boosted decision trees have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest baselines tuned for 4 h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm development. By improving modelling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains.

摘要

表格数据,即按行和列组织的电子表格,在从生物医学到粒子物理、再到经济学和气候科学等各个科学领域中无处不在。基于其他列填充标签列中缺失值的基本预测任务,对于生物医学风险模型、药物发现和材料科学等各种不同的应用来说至关重要。尽管深度学习彻底改变了从原始数据中学习的方式,并带来了众多备受瞩目的成功案例,但在过去20年里,梯度提升决策树在表格数据领域占据主导地位。在此,我们展示表格先验数据拟合网络(TabPFN),这是一种表格基础模型,在样本数量多达10000个的数据集上,它以较大优势超越了之前所有方法,且训练时间大幅减少。在分类设置中,TabPFN在2.8秒内的表现优于经过4小时调优的最强基线模型的集成。作为一种基于生成式Transformer的基础模型,该模型还支持微调、数据生成、密度估计以及学习可重复使用的嵌入。TabPFN是一种学习算法,它本身是在数百万个合成数据集上学习得到的,展示了这种方法在算法开发方面的强大力量。通过提高跨不同领域的建模能力,TabPFN有潜力加速科学发现,并增强各个领域的重要决策。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fc6/11711098/389a70d27529/41586_2024_8328_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验