Li Gang, Zrimec Jan, Ji Boyang, Geng Jun, Larsbrink Johan, Zelezniak Aleksej, Nielsen Jens, Engqvist Martin Km
Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.
Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark.
Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.
A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on the test dataset, or whether further model improvement is possible. In biology, this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the metrics of performance attainable by regression models on the test dataset.
We address this challenge by deriving an expected upper bound for the coefficient of determination ( ) for regression models when tested on the holdout dataset. This upper bound depends only on the noise associated with the response variable in a dataset as well as its variance. The upper bound estimate was validated via Monte Carlo simulations and then used as a tool to bootstrap performance of regression models trained on biological datasets, including protein sequence data, transcriptomic data, and genomic data.
The new method for estimating upper bounds for model performance on test data should aid researchers in developing ML regression models that reach their maximum potential. Although we study biological datasets in this work, the new upper bound estimates will hold true for regression models from any research field or application area where response variables have associated noise.
开发机器学习回归模型面临的一个挑战是,很难知道在测试数据集上是否已达到最大性能,或者是否有可能进一步改进模型。在生物学中,这个问题尤为突出,因为样本标签(响应变量)通常是通过实验获得的,因此与之相关存在实验噪声。这种标签噪声对回归模型在测试数据集上可达到的性能指标构成了根本限制。
我们通过推导回归模型在留出数据集上进行测试时决定系数( )的预期上限来应对这一挑战。这个上限仅取决于数据集中与响应变量相关的噪声及其方差。通过蒙特卡洛模拟验证了上限估计,然后将其用作引导在生物数据集(包括蛋白质序列数据、转录组数据和基因组数据)上训练的回归模型性能的工具。
用于估计测试数据上模型性能上限的新方法应有助于研究人员开发出发挥最大潜力的机器学习回归模型。尽管我们在这项工作中研究生物数据集,但新的上限估计对于来自任何研究领域或应用领域、其响应变量存在相关噪声的回归模型都将成立。