AI Grand ICT Research Center, Dong-eui University, Busan 47340, Republic of Korea.
Department of Computer Software Engineering, Dong-eui University, Busan 47340, Republic of Korea.
Sensors (Basel). 2022 Dec 9;22(24):9656. doi: 10.3390/s22249656.
In this paper, we propose an intra-picture prediction method for depth video by a block clustering through a neural network. The proposed method solves a problem that the block that has two or more clusters drops the prediction performance of the intra prediction for depth video. The proposed neural network consists of both a spatial feature prediction network and a clustering network. The spatial feature prediction network utilizes spatial features in vertical and horizontal directions. The network contains a 1D CNN layer and a fully connected layer. The 1D CNN layer extracts the spatial features for a vertical direction and a horizontal direction from a top block and a left block of the reference pixels, respectively. 1D CNN is designed to handle time-series data, but it can also be applied to find the spatial features by regarding a pixel order in a certain direction as a timestamp. The fully connected layer predicts the spatial features of the block to be coded through the extracted features. The clustering network finds clusters from the spatial features which are the outputs of the spatial feature prediction network. The network consists of 4 CNN layers. The first 3 CNN layers combine two spatial features in the vertical and horizontal directions. The last layer outputs the probabilities that pixels belong to the clusters. The pixels of the block are predicted by the representative values of the clusters that are the average of the reference pixels belonging to the clusters. For the intra prediction for various block sizes, the block is scaled to the size of the network input. The prediction result through the proposed network is scaled back to the original size. In network training, the mean square error is used as a loss function between the original block and the predicted block. A penalty for output values far from both ends is introduced to the loss function for clear network clustering. In the simulation results, the bit rate is saved by up to 12.45% under the same distortion condition compared with the latest video coding standard.
在本文中,我们通过神经网络提出了一种基于块聚类的深度视频帧内预测方法。所提出的方法解决了具有两个或更多聚类的块会降低深度视频帧内预测性能的问题。所提出的神经网络由空间特征预测网络和聚类网络组成。空间特征预测网络利用垂直和水平方向的空间特征。该网络包含一维卷积神经网络(1D CNN)层和全连接层。1D CNN 层从参考像素的顶部块和左侧块分别提取垂直方向和水平方向的空间特征。1D CNN 层旨在处理时间序列数据,但也可以通过将特定方向上的像素顺序视为时间戳来应用于找到空间特征。全连接层通过提取的特征预测要编码的块的空间特征。聚类网络从空间特征预测网络的输出中找到聚类。该网络由 4 个卷积神经网络(CNN)层组成。前 3 个 CNN 层将垂直和水平方向的两个空间特征进行组合。最后一层输出像素属于聚类的概率。通过聚类的代表值预测块中的像素,聚类的代表值是属于聚类的参考像素的平均值。对于各种块大小的帧内预测,块被缩放为网络输入的大小。通过所提出的网络的预测结果被缩放回原始大小。在网络训练中,原始块和预测块之间的均方误差用作损失函数。为了实现清晰的网络聚类,损失函数中引入了输出值远离两端的惩罚。在仿真结果中,与最新的视频编码标准相比,在相同失真条件下,比特率可节省高达 12.45%。