Wang Jianan, Shi Yang, Chen Zhaoyun, Wen Mei
National University of Defense Technology, Deya Road, Changsha, 410000, Hunan, China; Key Laboratory of Advanced Microprocessor Chips and Systems, Deya Road, Changsha, 410000, Hunan, China.
National University of Defense Technology, Deya Road, Changsha, 410000, Hunan, China; Key Laboratory of Advanced Microprocessor Chips and Systems, Deya Road, Changsha, 410000, Hunan, China.
Neural Netw. 2025 Oct;190:107703. doi: 10.1016/j.neunet.2025.107703. Epub 2025 Jun 15.
Cascading, as a multi-model combination approach, balances model execution efficiency and accuracy. This excellent method is widely used in various industrial production and commercial deployments, particularly in cloud-based inference services. With the increasing demand for low-latency services, researchers are more focused on the execution efficiency of these models, especially device utilization. It is highly desirable to fully utilize GPU resources by multiplexing different inference tasks on the same GPU through device-sharing techniques such as Multiprocessing Services (MPS). However, we find it struggling when applying MPS to cascade neural networks consisting of multiple related submodels. These difficulties arise primarily from the early-exit mechanism and the execution order of the submodels. To address these obstacles, we analyzed the characteristics of cascade neural networks and combined them with device-sharing optimization techniques. Our findings indicate that improving the efficiency of cascade models through device sharing requires a balance between the gains from sharing devices and the potential wastage of computation resources due to the early-exit mechanism. Based on our analysis, we proposed ESCAN, a GPU-sharing optimization framework for online inference of cascade neural networks. This framework includes exit-ratio-aware batch-parallel execution strategies and the corresponding resource allocation algorithms, all integrated into PyTorch. Experiments show that ESCAN improves inference efficiency by an average of 19.53% compared to the execution strategy with all cascade submodels running in parallel. Additionally, ESCAN significantly improves the efficiency of searching for computation resource allocation schemes. ESCAN optimizes the utilization of computational resources through effective GPU-sharing, greatly enhancing the efficiency of online inference for cascade models. This approach delivers a low-latency, high-precision optimization solution for interactive online services based on cascade neural networks.
级联作为一种多模型组合方法,平衡了模型执行效率和准确性。这种出色的方法广泛应用于各种工业生产和商业部署中,尤其是在基于云的推理服务中。随着对低延迟服务需求的增加,研究人员更加关注这些模型的执行效率,特别是设备利用率。通过诸如多处理服务(MPS)之类的设备共享技术在同一GPU上复用不同的推理任务来充分利用GPU资源是非常可取的。然而,我们发现在将MPS应用于由多个相关子模型组成的级联神经网络时会遇到困难。这些困难主要源于早期退出机制和子模型的执行顺序。为了解决这些障碍,我们分析了级联神经网络的特性,并将其与设备共享优化技术相结合。我们的研究结果表明,通过设备共享提高级联模型的效率需要在共享设备的收益与由于早期退出机制导致的计算资源潜在浪费之间取得平衡。基于我们的分析,我们提出了ESCAN,一种用于级联神经网络在线推理的GPU共享优化框架。该框架包括退出率感知的批并行执行策略和相应的资源分配算法,所有这些都集成到了PyTorch中。实验表明,与所有级联子模型并行运行的执行策略相比,ESCAN的推理效率平均提高了19.53%。此外,ESCAN显著提高了搜索计算资源分配方案的效率。ESCAN通过有效的GPU共享优化了计算资源的利用率,极大地提高了级联模型在线推理的效率。这种方法为基于级联神经网络的交互式在线服务提供了一种低延迟、高精度的优化解决方案。