![]() Deep Video Prediction Through Sparse Motion Regularization
IEEE International Conference on Image Processing (ICIP), Oct. 2020.
|
This paper leverages a classic prediction technique,
known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for
video prediction. Learning-based prediction methods with
explicit motion models often suffer from having to estimate
large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based
prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of
few critical pixels and their motion vectors. The prediction
is achieved by gradually refining the estimate of a future
frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets
in one-step and multi-step prediction tests. It shows good
generalization results and is able to learn well on small
training data
|
![]() SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning
IEEE International Conference on Computer Vision (ICCV), Oct. 2019.
|
This paper leverages a classic prediction technique,
known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for
video prediction. Learning-based prediction methods with
explicit motion models often suffer from having to estimate
large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based
prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of
few critical pixels and their motion vectors. The prediction
is achieved by gradually refining the estimate of a future
frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets
in one-step and multi-step prediction tests. It shows good
generalization results and is able to learn well on small
training data.
|
![]() P-frame Coding Proposal by NCTU: Parametric Video Prediction through Backprop-based Motion Estimation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2020.
|
This paper presents a parametric video prediction
scheme with backprop-based motion estimation, in response
to the CLIC challenge on P-frame compression. Recognizing
that most learning-based video codecs rely on optical
flow-based temporal prediction and suffer from having to
signal a large amount of motion information, we propose
to perform parametric overlapped block motion compensation
on a sparse motion field. In forming this sparse motion
field, we conduct the steepest descent algorithm on a loss
function for identifying critical pixels, of which the motion
vectors are communicated to the decoder. Moreover, we introduce
a critical pixel dropout mechanism to strike a good
balance between motion overhead and prediction quality.
Compression results with HEVC-based residual coding on
CLIC validation sequences show that our parametric video
prediction achieves higher PSNR and MS-SSIM than optical
flow-based warping. Moreover, our critical pixel dropout
mechanism is found beneficial in terms of rate-distortion
performance. Our scheme offers the potential for working
with learned residual coding.
|
![]() ![]() IEEE Open Journal of Circuits and Systems, Dec. 2021.
|
This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE’s. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model. The source code of ANFIC can be found at https://github.com/dororojames/ANFIC.
|
![]() ![]() IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2021.
|
This paper presents a new attempt at using augmented normalizing flows (ANF) for lossy image compression. ANF is a specific type of normalizing flow models that augment the input with an independent noise, allowing a smoother transformation from the augmented input space to the latent space. Inspired by the fact that ANF can offer greater expressivity by stacking multiple variational autoencoders (VAE), we generalize the popular VAE-based compression framework by the autoencoding transforms of ANF. When evaluated on Kodak dataset, our ANF-based model provides 3.4% higher BD-rate saving as compared with a VAE-based baseline that implements hyper-prior with mean prediction. Interestingly, it benefits even more from the incorporation of a post-processing network, showing 11.8% rate saving as compared to 6.0% with the baseline plus post-processing.
|
![]() A Hybrid Layered Image Compressor with Deep-Learning Technique
IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 2020.
|
The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.
|
![]() Learned Image Compression With Soft Bit-based Rate-distortion Optimization
IEEE International Conference on Image Processing (ICIP), Oct. 2019.
|
This paper introduces the notion of soft bits to address the
rate-distortion optimization for learning-based image compression.
Recent methods for such compression train an autoencoder
end-to-end with an objective to strike a balance between
distortion and rate. They are faced with the zero gradient
issue due to quantization and the difficulty of estimating
the rate accurately. Inspired by soft quantization, we represent
quantization indices of feature maps with differentiable
soft bits. This allows us to couple tightly the rate estimation
with context-adaptive binary arithmetic coding. It also
provides a differentiable distortion objective function. Experimental
results show that our approach achieves the state-ofthe-
art compression performance among the learning-based
schemes in terms of MS-SSIM and PSNR.
|
![]() An Autoencoder-based Image Compressor with Principle Component Analysis and Soft-Bit Rate Estimation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2019.
|
We propose a lossy image compression system using the
deep-learning autoencoder structure to participate in the
Challenge on Learned Image Compression (CLIC) 2018.
Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact
representation of the original image. The bit allocation and
bitrate control are implemented by using the importance
maps and quantizer. The importance maps are generated
by a separate neural net in the encoder. The autoencoder
and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM,
and a rate estimate. Our aim is to produce reconstructed
images with good subjective quality subject to the 0.15 bitsper-pixel constraint.
|
![]() ![]() Data Compression Conference (DCC), Mar. 2021.
|
This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.
|
![]() Reinforcement Learning for HEVC/H.265 Intra-Frame Rate Control
IEEE International Symposium on Circuits and Systems (ISCAS), Italy, May 2018.
|
Reinforcement learning has proven effective for solving decision making
problems. However, its application to modern video codecs has yet to be
seen. This paper presents an early attempt to introduce reinforcement
learning to HEVC/H.265 intra-frame rate control. The task is to determine
a quantization parameter value for every coding tree unit in a frame,
with the objective being to minimize the frame-level distortion subject
to a rate constraint. We draw an analogy between the rate control problem
and the reinforcement learning problem, by considering the texture complexity
of coding tree units and bit balance as the environment state, the
quantization parameter value as an action that an agent needs to take,
and the negative distortion of the coding tree unit as an immediate reward.
We train a neural network based on Q-learning to be our agent, which
observes the state to evaluate the reward for each possible action. When
trained on only limited sequences, the proposed model can already perform
comparably with the rate control algorithm in HM-16.15.
|
![]() Reinforcement Learning for HEVC/H.265 Frame-level Bit Allocation
IEEE International Conference on Digital Signal Processing (DSP), China, Nov. 2018.
|
Frame-level bit allocation is crucial to video rate control.
The problem is often cast as minimizing the distortions of a
group of video frames subjective to a rate constraint.
When these video frames are related through inter-frame
prediction, the bit allocation for different frames exhibits
dependency. To address such dependency, this paper introduces
reinforcement learning. We first consider frame-level texture
complexity and bit balance as a state signal, define the bit
allocation for each frame as an action, and compute the negative
frame-level distortion as an immediate reward signal. We then train
a neural network to be our agent, which observes the state to
allocate bits to each frame in order to maximize cumulative reward.
As compared to the rate control scheme in HM-16.15, our method shows
better PSNR performance while having smaller bit rate fluctuations.
|
![]() HEVC/H.265 Coding Unit Split Decision Using Deep Reinforcement Learning
IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, Nov. 2017.
|
The video coding community has long been seeking more effective
rate-distortion optimization techniques than the widely adopted
greedy approach. The difficulty arises when we need to predict how
the coding mode decision made in one stage would affect subsequent
decisions and thus the overall coding performance. Taking a data-driven
approach, we introduce in this paper deep reinforcement learning (RL)
as a mechanism for the coding unit (CU) split decision in HEVC/H.265.
We propose to regard the luminance samples of a CU together with the
quantization parameter as its state, the split decision as an action,
and the reduction in ratedistortion cost relative to keeping the current
CU intact as the immediate reward. Based on the Q-learning algorithm,
we learn a convolutional neural network to approximate the ratedistortion
cost reduction of each possible state-action pair. The proposed scheme
performs compatibly with the current full rate-distortion optimization
scheme in HM-16.15, incurring a 2.5% average BD-rate loss. While also
performing similarly to a conventional scheme that treats the split
decision as a binary classification problem, our scheme can additionally
quantify the rate-distortion cost reduction, enabling more applications.
|
![]() All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
|
In this paper we tackle the problem of unsupervised
domain adaptation for the task of semantic segmentation,
where we attempt to transfer the knowledge learned upon
synthetic datasets with ground-truth labels to real-world
images without any annotation. With the hypothesis that
the structural content of images is the most informative and
decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant
Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific
texture representations, which can further realize imagetranslation across domains and enable label transfer to improve segmentation performance. Extensive experiments
verify the effectiveness of our proposed DISE model and
demonstrate its superiority over several state-of-the-art approaches.
|
![]() ![]() IEEE International Conference on Multimedia and Expo (ICME), July 2021.
|
This paper addresses fast semantic segmentation on video.
Video segmentation often calls for real-time, or even faster
than real-time, processing. One common recipe for conserving
computation arising from feature extraction is to propagate
features of few selected keyframes. However, recent advances
in fast image segmentation make these solutions less
attractive. To leverage fast image segmentation for furthering
video segmentation, we propose a simple yet efficient propagation
framework. Specifically, we perform lightweight flow
estimation in 1/8-downscaled image space for temporal warping
in segmentation outpace space. Moreover, we introduce
a guided spatially-varying convolution for fusing segmentations
derived from the previous and current frames, to mitigate
propagation error and enable lightweight feature extraction
on non-keyframes. Experimental results on Cityscapes
and CamVid show that our scheme achieves the state-of-the-art
accuracy-throughput trade-off on video segmentation.
|
![]() ![]() IEEE International Conference on Multimedia and Expo (ICME), July 2021.
|
This work addresses weakly-supervised image semantic seg-mentation
based on image-level class labels. One commonapproach to this task
is to propagate the activation scores ofClass Activation Maps (CAMs)
using a random-walk mecha-nism in order to arrive at complete pseudo
labels for traininga semantic segmentation network in a fully-supervised
man-ner. However, the feed-forward nature of the random walkimposes
no regularization on the quality of the resulting com-plete pseudo labels.
To overcome this issue, we propose aGraph Convolutional Network
(GCN)-based feature propaga-tion framework. We formulate the
generation of completepseudo labels as a semi-supervised learning
task and learna 2-layer GCN separately for every training image by
back-propagating a Laplacian and an entropy regularization loss.Experimental
results on the PASCAL VOC 2012 dataset con-firm the superiority of our
scheme to several state-of-the-artbaselines.Index Terms?�Weakly-supervised
image semantic seg-mentation, Graph Convolutional Networks
|
![]() Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting
IEEE International Symposium on Circuits and Systems (ISCAS), Spain, Oct 2020.
|
This paper addresses the problem of fast semantic segmentation on compressed video.
Unlike most prior works for video segmentation, which perform feature propagation based on optical flow estimates or sophisticated warping techniques,
ours takes advantage of block motion vectors in the compressed bitstream to propagate the segmentation of a keyframe to subsequent non-keyframes.
This approach, however, needs to respect the inter-frame prediction structure,
which often suggests recursive, multi-step prediction with error propagation and accumulation in the temporal dimension. To tackle the issue,
we refine the motion-compensated segmentation using inpainting. Our inpainting network incorporates guided non-local attention for long-range reference and pixel-adaptive convolution for ensuring the local coherence of the segmentation.
A fusion step then follows to combine both the motion-compensated and inpainted segmentations.
Experimental results show that our method outperforms the state-of-the-art baselines in terms of segmentation accuracy.
Moreover, it introduces the least amount of network parameters and multiply-add operations for non-keyframe segmentation.
|
![]() ![]() |
![]() ![]() IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
|
This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to
suit individual viewing devices. We aim to jointly optimize video downscaling and upscaling as a combined task. Most recent
studies focus on image-based solutions, which do not consider temporal information. We present two joint optimization
approaches based on invertible neural networks with coupling layers. Our Long Short-Term Memory Video Rescaling Network
(LSTM-VRN) leverages temporal information in the low-resolution video to form an explicit prediction of the missing
high-frequency information for upscaling. Our Multi-input Multi-output Video Rescaling Network (MIMO-VRN) proposes a new
strategy for downscaling and upscaling a group of video frames simultaneously. Not only do they outperform the image-based
invertible model in terms of quantitative and qualitative results, but also show much improved upscaling quality than the
video rescaling methods without joint optimization. To our best knowledge, this work is the first attempt at the joint optimization
of video downscaling and upscaling.
|
![]() Class-incremental Learning with Rectified Feature-Graph Preservation
Asian Conference on Computer Vision (ACCV), Japan, Nov. 2020.
|
In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme
of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of
recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have
been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these
regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our
weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show
how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental
results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in
reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes.
|
![]() Learning Goal-oriented Visual Dialogue: Imitating and Surpassing Analytic Experts
IEEE International Conference on Multimedia and Expo (ICME), July 2019.
|
This paper tackles the problem of learning a questioner in
the goal-oriented visual dialog task. Several previous works
adopt model-free reinforcement learning. Most pretrain the
model from a finite set of human-generated data. We argue
that using limited demonstrations to kick-start the questioner
is insufficient due to the large policy search space. Inspired
by a recently proposed information theoretic approach, we
develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take
advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the
GuessWhat?! dataset show that our method has the combined
merits of imitation and reinforcement learning, achieving the
state-of-the-art performance.
|
![]() Learning Priors for Adversarial Autoencoders
Asia-Pacific Signal and Information Processing Association (APSIPA), USA, Nov. 2018.
|
Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent
studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when
its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial
autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones
that can better characterize the data distribution. Experimental results show that the proposed model can generate better
image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings.
Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.
|
![]() ![]() IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2021.
|
This paper demonstrates a model-based reinforcement learning framework for training a self-flying drone.
We implement the Dreamer proposed in a prior work as an environment model that responds to the action taken
by the drone by predicting the next video frame as a new state signal. The Dreamer is a conditional video sequence generator.
This model-based environment avoids the time-consuming interactions between the agent and the environment,
speeding up largely the training process. This demonstration showcases for the first time the application of the
Dreamer to train an agent that can finish the racing task in the Airsim simulator.
|