基于 SOM 的分层抽样的人工神经网络数据拆分。

Data splitting for artificial neural networks using SOM-based stratified sampling.

机构信息

Research and Development, United Water, Adelaide, SA 5001, Australia.

出版信息

Neural Netw. 2010 Mar;23(2):283-94. doi: 10.1016/j.neunet.2009.11.009. Epub 2009 Nov 26.

DOI:10.1016/j.neunet.2009.11.009

PMID:19959327

Abstract

Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets.

摘要

数据分割是人工神经网络（ANN）开发中的一个重要考虑因素，其中留一交叉验证通常用于确保泛化。即使对于中等样本量，用于数据分割的采样方法也会对用于训练、测试和验证 ANN 的子集的质量产生重大影响。数据分割不佳可能导致模型性能不准确且高度变化；但是，ANN 建模者很少考虑采样方法的选择。由于在 ANN 开发过程中通常只进行一次留一采样，因此对采样的信心增加至关重要。本文考虑了使用不同数据分割方法获得的子集质量的可变性。开发了一种基于自组织映射（SOM）的分层抽样的新方法，并确定了一些准则来设置 SOM 大小和样本分配，以最小化数据集的偏差和方差。使用 ANN 函数逼近任务示例，将基于 SOM 的方法与随机采样、DUPLEX、系统分层采样和尝试错误采样进行评估，以最小化数据集之间的统计差异。在这些方法中，DUPLEX 被发现提供了基准性能，具有良好的模型性能，没有变化。结果表明，基于 SOM 的方法也可以可靠地生成高质量的样本，因此可以比其他方法更有信心地使用，特别是在非均匀数据集的情况下，还具有在大数据集上执行数据分割的可扩展性优势。

相似文献

Data splitting for artificial neural networks using SOM-based stratified sampling.

Neural Netw. 2010 Mar;23(2):283-94. doi: 10.1016/j.neunet.2009.11.009. Epub 2009 Nov 26.

Understanding and reducing variability of SOM neighbourhood structure.

Neural Netw. 2006 Jul-Aug;19(6-7):838-46. doi: 10.1016/j.neunet.2006.05.017. Epub 2006 Jul 7.

Prediction of HPLC retention index using artificial neural networks and IGroup E-state indices.

J Chem Inf Model. 2009 Apr;49(4):788-99. doi: 10.1021/ci9000162.

Clustering: a neural network approach.

Neural Netw. 2010 Jan;23(1):89-107. doi: 10.1016/j.neunet.2009.08.007. Epub 2009 Aug 29.

Adaptive importance sampling for value function approximation in off-policy reinforcement learning.

Neural Netw. 2009 Dec;22(10):1399-410. doi: 10.1016/j.neunet.2009.01.002. Epub 2009 Jan 23.

A new approach to training back-propagation artificial neural networks: empirical evaluation on ten data sets from clinical studies.

Stat Med. 2002 May 15;21(9):1309-30. doi: 10.1002/sim.1107.

Uncertainty in the output of artificial neural networks.

IEEE Trans Med Imaging. 2003 Jul;22(7):913-21. doi: 10.1109/TMI.2003.815061.

A new constructive algorithm for architectural and functional adaptation of artificial neural networks.

IEEE Trans Syst Man Cybern B Cybern. 2009 Dec;39(6):1590-605. doi: 10.1109/TSMCB.2009.2021849. Epub 2009 Jun 5.

Prediction of human skin permeability using artificial neural network (ANN) modeling.

Acta Pharmacol Sin. 2007 Apr;28(4):591-600. doi: 10.1111/j.1745-7254.2007.00528.x.

Self-organizing neural networks to support the discovery of DNA-binding motifs.

Neural Netw. 2006 Jul-Aug;19(6-7):950-62. doi: 10.1016/j.neunet.2006.05.023. Epub 2006 Jul 12.

引用本文的文献

Neuronal Decoding of Decisions in Multidimensional Feature Space Using a Gated Recurrent Variational Autoencoder.

bioRxiv. 2025 Aug 25:2025.08.20.671126. doi: 10.1101/2025.08.20.671126.

A dataset of Natural Gas and Liquid Level for Oil Field Production Prediction in China.

Sci Data. 2025 Jun 23;12(1):1071. doi: 10.1038/s41597-025-05309-w.

A partitioned conditioned Latin hypercube sampling method considering spatial heterogeneity in digital soil mapping.

Sci Rep. 2025 Apr 14;15(1):12851. doi: 10.1038/s41598-025-95631-5.

AI Machine Learning-Based Diabetes Prediction in Older Adults in South Korea: Cross-Sectional Analysis.

JMIR Form Res. 2025 Jan 21;9:e57874. doi: 10.2196/57874.

Mapping the Flammability Space of Sustainable Refrigerant Mixtures through an Artificial Neural Network Based on Molecular Descriptors.

ACS Sustain Chem Eng. 2024 Jul 23;12(31):11561-11577. doi: 10.1021/acssuschemeng.4c01961. eCollection 2024 Aug 5.

Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI.

PeerJ Comput Sci. 2024 Mar 27;10:e1925. doi: 10.7717/peerj-cs.1925. eCollection 2024.

Advancing glaucoma detection with convolutional neural networks: a paradigm shift in ophthalmology.

Rom J Ophthalmol. 2023 Jul-Sep;67(3):222-237. doi: 10.22336/rjo.2023.39.

Predicting the risk of hypertension using machine learning algorithms: A cross sectional study in Ethiopia.

PLoS One. 2023 Aug 24;18(8):e0289613. doi: 10.1371/journal.pone.0289613. eCollection 2023.

QSAR analysis on a large and diverse set of potent phosphoinositide 3-kinase gamma (PI3Kγ) inhibitors using MLR and ANN methods.

Sci Rep. 2022 Apr 12;12(1):6090. doi: 10.1038/s41598-022-09843-0.

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.

PLoS One. 2021 Aug 12;16(8):e0256152. doi: 10.1371/journal.pone.0256152. eCollection 2021.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于 SOM 的分层抽样的人工神经网络数据拆分。

Data splitting for artificial neural networks using SOM-based stratified sampling.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献