Research Content

Image/Video Compression
Semantic Segmentation
- Domain Adaptation
- Image / Video Semantic Segmentation
Human Pose Estimation Using Radar
Video Super Resolution and Rescaling
Incremental Learning
Visual Question Answering
Deep Generative Model
AI Drone

Deep Video Prediction

Deep Video Prediction Through Sparse Motion Regularization

Yung-Han Ho, Chih Chun Chan, Wen-Hsiao Peng

IEEE International Conference on Image Processing (ICIP), Oct. 2020.

This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data

SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning

Yung-Han Ho, Chuan-Yuan Cho, Wen-Hsiao Peng, Guo-Lun Jin

IEEE International Conference on Computer Vision (ICCV), Oct. 2019.

PDF | Project Page

Learning-based Video Compression

P-frame Coding Proposal by NCTU: Parametric Video Prediction through Backprop-based Motion Estimation

Yung-Han Ho, Chih-Chun Chan, David Alexandre, Wen-Hsiao Peng, Chih-Peng Chang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2020.

PDF

This paper presents a parametric video prediction scheme with backprop-based motion estimation, in response to the CLIC challenge on P-frame compression. Recognizing that most learning-based video codecs rely on optical flow-based temporal prediction and suffer from having to signal a large amount of motion information, we propose to perform parametric overlapped block motion compensation on a sparse motion field. In forming this sparse motion field, we conduct the steepest descent algorithm on a loss function for identifying critical pixels, of which the motion vectors are communicated to the decoder. Moreover, we introduce a critical pixel dropout mechanism to strike a good balance between motion overhead and prediction quality. Compression results with HEVC-based residual coding on CLIC validation sequences show that our parametric video prediction achieves higher PSNR and MS-SSIM than optical flow-based warping. Moreover, our critical pixel dropout mechanism is found beneficial in terms of rate-distortion performance. Our scheme offers the potential for working with learned residual coding.

Learning-based Image Compression

ANFIC: Image Compression Using Augmented Normalizing Flows

Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang, Marek Domanski

IEEE Open Journal of Circuits and Systems, Dec. 2021.

PDF

This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE’s. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model. The source code of ANFIC can be found at https://github.com/dororojames/ANFIC.

End-to-End Learned Image Compression with Augmented Normalizing Flows

Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2021.

PDF

This paper presents a new attempt at using augmented normalizing flows (ANF) for lossy image compression. ANF is a specific type of normalizing flow models that augment the input with an independent noise, allowing a smoother transformation from the augmented input space to the latent space. Inspired by the fact that ANF can offer greater expressivity by stacking multiple variational autoencoders (VAE), we generalize the popular VAE-based compression framework by the autoencoding transforms of ANF. When evaluated on Kodak dataset, our ANF-based model provides 3.4% higher BD-rate saving as compared with a VAE-based baseline that implements hyper-prior with mean prediction. Interestingly, it benefits even more from the incorporation of a post-processing network, showing 11.8% rate saving as compared to 6.0% with the baseline plus post-processing.

A Hybrid Layered Image Compressor with Deep-Learning Technique

Wei-Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang

IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 2020.

PDF

The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to reﬁne the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit ﬂoating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.

Learned Image Compression With Soft Bit-based Rate-distortion Optimization

David Alexandre, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang

IEEE International Conference on Image Processing (ICIP), Oct. 2019.

PDF

This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-ofthe- art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.

An Autoencoder-based Image Compressor with Principle Component Analysis and Soft-Bit Rate Estimation

Chih-Peng Chang, David Alexandre, Wen-Hsiao Peng, Hsueh-Ming Hang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2019.

PDF

We propose a lossy image compression system using the deep-learning autoencoder structure to participate in the Challenge on Learned Image Compression (CLIC) 2018. Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact representation of the original image. The bit allocation and bitrate control are implemented by using the importance maps and quantizer. The importance maps are generated by a separate neural net in the encoder. The autoencoder and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM, and a rate estimate. Our aim is to produce reconstructed images with good subjective quality subject to the 0.15 bitsper-pixel constraint.

Reinforcement Learning for Video Encoder Control

A Dual-Critic Reinforcement Learning Framework for Frame-level Bit Allocation in HEVC/H.265

Yung-Han Ho, Guo-Lun Jin, Yun Liang, Wen-Hsiao Peng, Xiao-Bo Li

Data Compression Conference (DCC), Mar. 2021.

PDF | Project Page

This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.

Reinforcement Learning for HEVC/H.265 Intra-Frame Rate Control

Jun-Hao Hu, Wen-Hsiao Peng, Chia-Hua Chung

IEEE International Symposium on Circuits and Systems (ISCAS), Italy, May 2018.

PDF

Reinforcement learning has proven effective for solving decision making problems. However, its application to modern video codecs has yet to be seen. This paper presents an early attempt to introduce reinforcement learning to HEVC/H.265 intra-frame rate control. The task is to determine a quantization parameter value for every coding tree unit in a frame, with the objective being to minimize the frame-level distortion subject to a rate constraint. We draw an analogy between the rate control problem and the reinforcement learning problem, by considering the texture complexity of coding tree units and bit balance as the environment state, the quantization parameter value as an action that an agent needs to take, and the negative distortion of the coding tree unit as an immediate reward. We train a neural network based on Q-learning to be our agent, which observes the state to evaluate the reward for each possible action. When trained on only limited sequences, the proposed model can already perform comparably with the rate control algorithm in HM-16.15.

Reinforcement Learning for HEVC/H.265 Frame-level Bit Allocation

Lian-Ching Chen, Jun-Hao Hu, Wen-Hsiao Peng

IEEE International Conference on Digital Signal Processing (DSP), China, Nov. 2018.

PDF

Frame-level bit allocation is crucial to video rate control. The problem is often cast as minimizing the distortions of a group of video frames subjective to a rate constraint. When these video frames are related through inter-frame prediction, the bit allocation for different frames exhibits dependency. To address such dependency, this paper introduces reinforcement learning. We first consider frame-level texture complexity and bit balance as a state signal, define the bit allocation for each frame as an action, and compute the negative frame-level distortion as an immediate reward signal. We then train a neural network to be our agent, which observes the state to allocate bits to each frame in order to maximize cumulative reward. As compared to the rate control scheme in HM-16.15, our method shows better PSNR performance while having smaller bit rate fluctuations.

HEVC/H.265 Coding Unit Split Decision Using Deep Reinforcement Learning

Chia-Hua Chung, Wen-Hsiao Peng, Jun-Hao Hu

IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, Nov. 2017.

PDF

The video coding community has long been seeking more effective rate-distortion optimization techniques than the widely adopted greedy approach. The difficulty arises when we need to predict how the coding mode decision made in one stage would affect subsequent decisions and thus the overall coding performance. Taking a data-driven approach, we introduce in this paper deep reinforcement learning (RL) as a mechanism for the coding unit (CU) split decision in HEVC/H.265. We propose to regard the luminance samples of a CU together with the quantization parameter as its state, the split decision as an action, and the reduction in ratedistortion cost relative to keeping the current CU intact as the immediate reward. Based on the Q-learning algorithm, we learn a convolutional neural network to approximate the ratedistortion cost reduction of each possible state-action pair. The proposed scheme performs compatibly with the current full rate-distortion optimization scheme in HM-16.15, incurring a 2.5% average BD-rate loss. While also performing similarly to a conventional scheme that treats the split decision as a binary classification problem, our scheme can additionally quantify the rate-distortion cost reduction, enabling more applications.

Domain Adaptation for Semantic Segmentation

All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation

Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, Wei-Chen Chiu

IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

PDF | Github | Project Page

In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize imagetranslation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

Image / Video Semantic Segmentation

GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video

Shih-Po Lee, Si-Cun Chen, Wen-Hsiao Peng

IEEE International Conference on Multimedia and Expo (ICME), July 2021.

PDF | Github | Project Page

This paper addresses fast semantic segmentation on video. Video segmentation often calls for real-time, or even faster than real-time, processing. One common recipe for conserving computation arising from feature extraction is to propagate features of few selected keyframes. However, recent advances in fast image segmentation make these solutions less attractive. To leverage fast image segmentation for furthering video segmentation, we propose a simple yet efficient propagation framework. Specifically, we perform lightweight flow estimation in 1/8-downscaled image space for temporal warping in segmentation outpace space. Moreover, we introduce a guided spatially-varying convolution for fusing segmentations derived from the previous and current frames, to mitigate propagation error and enable lightweight feature extraction on non-keyframes. Experimental results on Cityscapes and CamVid show that our scheme achieves the state-of-the-art accuracy-throughput trade-off on video segmentation.

Weakly-Supervised Image Semantic Segmentation Using Graph Convolutional Networks

Shun-Yi Pan*, Cheng-You Lu*, Shih-Po Lee, Wen-Hsiao Peng

IEEE International Conference on Multimedia and Expo (ICME), July 2021.

PDF | Github | Project Page

This work addresses weakly-supervised image semantic seg-mentation based on image-level class labels. One commonapproach to this task is to propagate the activation scores ofClass Activation Maps (CAMs) using a random-walk mecha-nism in order to arrive at complete pseudo labels for traininga semantic segmentation network in a fully-supervised man-ner. However, the feed-forward nature of the random walkimposes no regularization on the quality of the resulting com-plete pseudo labels. To overcome this issue, we propose aGraph Convolutional Network (GCN)-based feature propaga-tion framework. We formulate the generation of completepseudo labels as a semi-supervised learning task and learna 2-layer GCN separately for every training image by back-propagating a Laplacian and an entropy regularization loss.Experimental results on the PASCAL VOC 2012 dataset con-firm the superiority of our scheme to several state-of-the-artbaselines.Index Terms?�Weakly-supervised image semantic seg-mentation, Graph Convolutional Networks

Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting

Stefanie Tanujaya, Tieh Chu, Jia-Hao Liu, Wen-Hsiao Peng

IEEE International Symposium on Circuits and Systems (ISCAS), Spain, Oct 2020.

PDF

This paper addresses the problem of fast semantic segmentation on compressed video. Unlike most prior works for video segmentation, which perform feature propagation based on optical flow estimates or sophisticated warping techniques, ours takes advantage of block motion vectors in the compressed bitstream to propagate the segmentation of a keyframe to subsequent non-keyframes. This approach, however, needs to respect the inter-frame prediction structure, which often suggests recursive, multi-step prediction with error propagation and accumulation in the temporal dimension. To tackle the issue, we refine the motion-compensated segmentation using inpainting. Our inpainting network incorporates guided non-local attention for long-range reference and pixel-adaptive convolution for ensuring the local coherence of the segmentation. A fusion step then follows to combine both the motion-compensated and inpainted segmentations. Experimental results show that our method outperforms the state-of-the-art baselines in terms of segmentation accuracy. Moreover, it introduces the least amount of network parameters and multiply-add operations for non-keyframe segmentation.

Human Pose Estimation Using Radar

Human Pose Estimation Using Millimeter Wave Radar

Project Page

Video Super Resolution and Rescaling

Video Rescaling Networks with Joint Optimization Strategies for Downscaling and Upscaling

Yan-Cheng Huang*, Yi-Hsin Chen*, Cheng-You Lu, Hui-Po Wang, Wen-Hsiao Peng and Ching-Chun Huang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.

PDF | Project Page

This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to suit individual viewing devices. We aim to jointly optimize video downscaling and upscaling as a combined task. Most recent studies focus on image-based solutions, which do not consider temporal information. We present two joint optimization approaches based on invertible neural networks with coupling layers. Our Long Short-Term Memory Video Rescaling Network (LSTM-VRN) leverages temporal information in the low-resolution video to form an explicit prediction of the missing high-frequency information for upscaling. Our Multi-input Multi-output Video Rescaling Network (MIMO-VRN) proposes a new strategy for downscaling and upscaling a group of video frames simultaneously. Not only do they outperform the image-based invertible model in terms of quantitative and qualitative results, but also show much improved upscaling quality than the video rescaling methods without joint optimization. To our best knowledge, this work is the first attempt at the joint optimization of video downscaling and upscaling.

Incremental Learning

Class-incremental Learning with Rectified Feature-Graph Preservation

Cheng-Hsun Lei*, Yi-Hsin Chen*, Wen-Hsiao Peng, Wei-Chen Chiu

Asian Conference on Computer Vision (ACCV), Japan, Nov. 2020.

PDF | Github

In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes.

Visual Question Answering

Learning Goal-oriented Visual Dialogue: Imitating and Surpassing Analytic Experts

Yen-Wei Chang, Wen-Hsiao Peng

IEEE International Conference on Multimedia and Expo (ICME), July 2019.

PDF

This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

Deep Generative Model

Learning Priors for Adversarial Autoencoders

Hui-Po Wang, Wen-Hsiao Peng, Wei-Jan Ko

Asia-Pacific Signal and Information Processing Association (APSIPA), USA, Nov. 2018.

PDF

Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.

AI Drone

Learning to Fly with a Video Generator

Chia-Chun Chung, Wen-Hsiao Peng, Teng-Hu Cheng and Chia-Hau Yu

IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2021.

Project Page

This paper demonstrates a model-based reinforcement learning framework for training a self-flying drone. We implement the Dreamer proposed in a prior work as an environment model that responds to the action taken by the drone by predicting the next video frame as a new state signal. The Dreamer is a conditional video sequence generator. This model-based environment avoids the time-consuming interactions between the agent and the environment, speeding up largely the training process. This demonstration showcases for the first time the application of the Dreamer to train an agent that can finish the racing task in the Airsim simulator.