The figure depicts the P-frame coding architecture with our proposed scheme. It comprises a flow estimation network (PWC-Net), a motion compensation network (MC-Net), and two α-map-guided codecs, which are the motion codec and the conditional inter-frame codec. The former encodes the optical flow map estimated between the coding frame $x_t$ and its reference frame $\hat{x}_{t-1}$, while the latter is adapted from CANF to encode the coding frame $x_t$ conditionally on the motion-compensated reference frame $x_c$. $x_t$ and $\hat{x}_{t-1}$ are of size $W \times H$. In this work, the α-map of dimension $W/64 \times H/64$ serves as a prior conditioning signal used to trade off between the bit rates consumed by the motion and the inter-frame codecs. Each component $\alpha_i \in [-1,1]$ in the α-map is a real number that corresponds to a distinct $64 \times 64$ patch $i$ in the input frame. By altering the α-map, a spatially-varying trade-off between the bit rates for motion coding and inter-frame coding is achieved. Moreover, the α-map is adapted on a frame-by-frame basis, allowing frame-adaptive optimization.
In our proposed method, to adapt the P-frame coding pipeline to the α-map, we incorporate Spatial Feature Transform (SFT) layers and SFT Residual Blocks (SFT Resblk) into the motion and the conditional inter-frame codecs. SFT applies spatially-adaptive affine transformation to the latent features in the encoding/decoding transforms, with the element-wise affine parameters derived from the prior conditioning modules.
We adopt the above objective function to train our system end-to-end. The patch-level bit rate $R_{M_i}$ for motion coding is weighted exponentially with a factor $\delta^{\alpha_i}$ against the patch-level bit rate $R_{R_i}$ for inter-frame coding according to the α-map. Where the base $\delta=10$ of the exponential is chosen empirically to compensate for the uneven ratio between $R_{M_i}$ and $R_{R_i}$. $N$ is the number of $64 \times 64$ patches in the input frame. It is seen that the model is trained to suppress $R_{M_i}$ for higher $R_{R_i}$ when $\alpha_i = 1$ and otherwise when $\alpha_i = -1$. $R_{M_i},R_{R_i}$ are weighted equally by setting $\alpha_i = 0$.
After training, we determine the α-map for content-adaptive bit allocation between motion and inter-frame coding. To this end, we propose two algorithms that use online back-propagation. The idea is to consider the α-map associated with each input frame as coding parameters to be updated on-the-fly by back-propagation.
In a greedy algorithm, we optimize the α-map for each frame sequentially. We minimize the equataion shown above with respect to the α-map, with $R_W$ taking the form of $\sum_i^N R_{M_i}+R_{R_i}$, where we discard the factor $\delta^{\alpha_i}$ because we wish to arrive at an α-map that can best trade off between the bit rates for motion and inter-frame coding in order to minimize the rate-distortion cost for the current coding frame. In a sense, this approach is sub-optimal because it optimizes greedily the α-map of a coding frame without regard to its impacts on future frames.
To explore the potential of our scheme, we additionally experiment with a look-ahead mechanism that optimizes the α-map of a coding frame by taking into account its impact on future frames. In particular, the resulting α-map of the first frame in display order is used for coding the first frame, whereas that of the second frame serves as its initial α-map, which is to be further optimized together with the subsequent frame in a sliding window manner.
We first visualize how the α-map impacts the motion bit rate and the quality of the compressed optical flow map patch-wisely, validating that our model reacts to the given α-map in the way it is designed. Next, we show the rate-distortion performance of the proposed content-adaptive method compared with the state-of-the-art learned video compression method DCVC, showing the effectiveness of our method. The two variants ( $Ours^1$ vs. $Ours^2$) of the proposed method refer to optimizing the α-map by considering only the current frame and by additionally looking ahead to one future frame, respectively. Finally, we visualize the optimized α-map and bit allocation results. Click on image to enlarge it.