用于多图形处理单元卡的块匹配算法的计算统一设备架构实现

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

作者信息

Massanes Francesc, Cadennes Marie, Brankov Jovan G

机构信息

Illinois Institute of Technology, Medical Imaging Research Center, Chicago IL 60616, USA.

出版信息

J Electron Imaging. 2011 Jul;20(3). doi: 10.1117/1.3606588.

DOI:10.1117/1.3606588

PMID:22347787

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3280822/

Abstract

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

摘要

在本文中，我们描述并评估了一种使用计算统一设备架构（CUDA）计算引擎在多个图形处理单元（GPU）上快速实现经典块匹配运动估计算法的方法。所实现的块匹配算法（BMA）使用绝对差和（SAD）误差准则以及全网格搜索（FS）来找到最佳块位移。在本评估中，我们使用整数和非整数搜索网格，比较了GPU和CPU实现对于各种尺寸图像的执行时间。结果表明，使用GPU卡对于整数搜索网格可将计算时间缩短200倍，对于非整数搜索网格可缩短1000倍。非整数搜索网格的额外加速来自于GPU具有用于图像插值的内置硬件这一事实。此外，当使用多个GPU卡时，所呈现的评估显示了跨多个卡的数据拆分方法的重要性，但随着卡数量的增加几乎可以实现线性加速。此外，我们将所提出的FS GPU实现的执行时间与两种现有的、高度优化的基于非全网格搜索CPU的运动估计方法进行了比较，即OpenCV中金字塔卢卡斯·卡纳德光流算法的实现以及H.264/AVC标准中的简化非对称多六边形搜索。在这些比较中，尽管FS GPU实现的计算复杂度明显高于非FS CPU实现，但FS GPU实现仍显示出适度的改进。我们还证明，对于视频监控中常用的分辨率为720×480像素的图像序列，使用两块NVIDIA C1060 Tesla GPU卡，所提出的GPU实现对于30帧每秒的实时运动估计足够快。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于多图形处理单元卡的块匹配算法的计算统一设备架构实现

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

用于多图形处理单元卡的块匹配算法的计算统一设备架构实现

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献