Ps和Qs：用于高效低延迟神经网络推理的量化感知剪枝

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

作者信息

Hawks Benjamin, Duarte Javier, Fraser Nicholas J, Pappalardo Alessandro, Tran Nhan, Umuroglu Yaman

机构信息

Fermi National Accelerator Laboratory, Batavia, IL, United States.

University of California San Diego, La Jolla, CA, United States.

出版信息

Front Artif Intell. 2021 Jul 9;4:676564. doi: 10.3389/frai.2021.676564. eCollection 2021.

DOI:10.3389/frai.2021.676564

PMID:34308339

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8299073/

Abstract

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term , and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.

摘要

针对硬件推理进行优化的高效机器学习实现具有广泛的益处，这取决于应用场景，从更低的推理延迟到更高的数据吞吐量以及更低的能耗。神经网络中两种流行的减少计算量的技术是剪枝（去除不重要的突触）和量化（降低计算精度）。在这项工作中，我们针对面向高能物理用例的超低延迟应用，探索神经网络训练过程中剪枝和量化之间的相互作用。为该研究开发的技术在许多其他领域都有潜在应用。我们研究了量化感知训练期间的各种剪枝配置（我们称之为），以及正则化、批量归一化和不同剪枝方案等技术对性能、计算复杂度和信息内容指标的影响。我们发现，对于我们的任务，量化感知剪枝产生的模型比单独的剪枝或量化在计算上更高效。此外，与贝叶斯优化等其他神经架构搜索技术相比，量化感知剪枝在计算效率方面通常表现相当或更好。令人惊讶的是，虽然具有不同训练配置的网络在基准应用中可能具有相似的性能，但网络中的信息内容可能会有显著差异，从而影响其泛化能力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

Ps和Qs：用于高效低延迟神经网络推理的量化感知剪枝

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

Ps和Qs：用于高效低延迟神经网络推理的量化感知剪枝

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

作者信息

机构信息

出版信息

相似文献

引用本文的文献