Maddouri Omar, Qian Xiaoning, Alexander Francis J, Dougherty Edward R, Yoon Byung-Jun
Department of Electrical and Computer Engineering, Texas A&M University, College Station TX 77843, USA.
Computational Science Initiative, Brookhaven National Laboratory, Upton NY 11973, USA.
Data Brief. 2022 Apr 2;42:108113. doi: 10.1016/j.dib.2022.108113. eCollection 2022 Jun.
Transfer learning (TL) techniques can enable effective learning in data scarce domains by allowing one to re-purpose data or scientific knowledge available in relevant source domains for predictive tasks in a target domain of interest. In this Data in Brief article, we present a synthetic dataset for binary classification in the context of Bayesian transfer learning, which can be used for the design and evaluation of TL-based classifiers. For this purpose, we consider numerous combinations of classification settings, based on which we simulate a diverse set of feature-label distributions with varying learning complexity. For each set of model parameters, we provide a pair of target and source datasets that have been jointly sampled from the underlying feature-label distributions in the target and source domains, respectively. For both target and source domains, the data in a given class and domain are normally distributed, where the distributions across domains are related to each other through a joint prior. To ensure the consistency of the classification complexity across the provided datasets, we have controlled the Bayes error such that it is maintained within a range of predefined values that mimic realistic classification scenarios across different relatedness levels. The provided datasets may serve as useful resources for designing and benchmarking transfer learning schemes for binary classification as well as the estimation of classification error.
迁移学习(TL)技术能够通过让人们重新利用相关源领域中可用的数据或科学知识,来完成目标领域中感兴趣的预测任务,从而在数据稀缺的领域实现有效学习。在这篇《数据简报》文章中,我们展示了一个用于贝叶斯迁移学习背景下二元分类的合成数据集,该数据集可用于基于迁移学习的分类器的设计与评估。为此,我们考虑了众多分类设置的组合,并在此基础上模拟了具有不同学习复杂度的各种特征 - 标签分布。对于每组模型参数,我们提供一对目标数据集和源数据集,它们分别是从目标域和源域的基础特征 - 标签分布中联合采样得到的。对于目标域和源域,给定类别和域中的数据均呈正态分布,其中跨域分布通过联合先验相互关联。为确保所提供数据集中分类复杂度的一致性,我们控制了贝叶斯误差,使其保持在一系列预定义值的范围内,这些值模拟了不同相关水平下的实际分类场景。所提供的数据集可作为设计和基准测试二元分类迁移学习方案以及估计分类误差的有用资源。