Suppr超能文献

基于 SOM 的分层抽样的人工神经网络数据拆分。

Data splitting for artificial neural networks using SOM-based stratified sampling.

机构信息

Research and Development, United Water, Adelaide, SA 5001, Australia.

出版信息

Neural Netw. 2010 Mar;23(2):283-94. doi: 10.1016/j.neunet.2009.11.009. Epub 2009 Nov 26.

Abstract

Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets.

摘要

数据分割是人工神经网络(ANN)开发中的一个重要考虑因素,其中留一交叉验证通常用于确保泛化。即使对于中等样本量,用于数据分割的采样方法也会对用于训练、测试和验证 ANN 的子集的质量产生重大影响。数据分割不佳可能导致模型性能不准确且高度变化;但是,ANN 建模者很少考虑采样方法的选择。由于在 ANN 开发过程中通常只进行一次留一采样,因此对采样的信心增加至关重要。本文考虑了使用不同数据分割方法获得的子集质量的可变性。开发了一种基于自组织映射(SOM)的分层抽样的新方法,并确定了一些准则来设置 SOM 大小和样本分配,以最小化数据集的偏差和方差。使用 ANN 函数逼近任务示例,将基于 SOM 的方法与随机采样、DUPLEX、系统分层采样和尝试错误采样进行评估,以最小化数据集之间的统计差异。在这些方法中,DUPLEX 被发现提供了基准性能,具有良好的模型性能,没有变化。结果表明,基于 SOM 的方法也可以可靠地生成高质量的样本,因此可以比其他方法更有信心地使用,特别是在非均匀数据集的情况下,还具有在大数据集上执行数据分割的可扩展性优势。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验