×
Research Teaching Members Join Us Publication
Research Content

Learning-based Video Compression
B-CANF: Adaptive B-frame Coding with Conditional Augmented Normalizing Flows
Mu-Jung Chen, Yi-Hsin Chen, and Wen-Hsiao Peng
Transactions on Circuits and Systems for Video Technology (TCSVT), 2023.
Over the past few years, learning-based video compression has become an active research area. However, most works focus on P-frame coding. Learned B-frame coding is under-explored and more challenging. This work introduces a novel B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. B-CANF additionally features two novel elements: frame-type adaptive coding and B*-frames. Our frame-type adaptive coding learns better bit allocation for hierarchical B-frame coding by dynamically adapting the feature distributions according to the B-frame type. Our B*-frames allow greater flexibility in specifying the group-of-pictures (GOP) structure by reusing the B-frame codec to mimic P-frame coding, without the need for an additional, separate P-frame codec. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance as compared to the other learned B-frame codecs.
Learned Hierarchical B-frame Coding with Adaptive Feature Modulation for YUV 4:2:0 Content
Mu-Jung Chen, Hong-Sheng Xie, Cheng Chien, Wen-Hsiao Peng, and Hsueh-Ming Hang
International Symposium on Circuits and Systems (ISCAS), May 2023.
This paper introduces a learned hierarchical Bframe coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. Bframe coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve contentadaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year's challenge on commonly used datasets in terms of PSNR-YUV.
Content-adaptive Motion Rate Adaption for Learned Video Compression
Chih-Hsuan Lin, Yi-Hsin Chen, and Wen-Hsiao Peng
Picture Coding Symposium (PCS), December 2022.
This paper introduces an online motion rate adaptation scheme for learned video compression, with the aim of achieving content-adaptive coding on individual test sequences to mitigate the domain gap between training and test data. It features a patch-level bit allocation map, termed the α-map, to trade off between the bit rates for motion and inter-frame coding in a spatially-adaptive manner. We optimize the α-map through an online back-propagation scheme at inference time. Moreover, we incorporate a look-ahead mechanism to consider its impact on future frames. Extensive experimental results confirm that the proposed scheme, when integrated into a conditional learned video codec, is able to adapt motion bit rate effectively, showing much improved rate-distortion performance particularly on test sequences with complicated motion characteristics. Index Terms—content-adaptive learned video compression, conditional inter-frame coding, bit allocation.
CANF-VC: Conditional Augmented Normalizing Flows for Video Compression
Yung-Han Ho, Chih-Peng Chang, Peng-Yu Chen, A. Gnutti, and Wen-Hsiao Peng
European Conference on Computer Vision (ECCV), Oct. 2022.
This paper presents an end-to-end learning-based video compression system, termed CANF-VC, based on conditional augmented normalizing flows (ANF). Most learned video compression systems adopt the same hybrid-based coding architecture as the traditional codecs. Recent research on conditional coding has shown the sub-optimality of the hybrid-based coding and opens up opportunities for deep generative models to take a key role in creating new coding frameworks. CANF-VC represents a new attempt that leverages the conditional ANF to learn a video generative model for conditional inter-frame coding. We choose ANF because it is a special type of generative model, which includes variational autoencoder as a special case and is able to achieve better expressiveness. CANF-VC also extends the idea of conditional coding to motion coding, forming a purely conditional coding framework. Extensive experimental results on commonly used datasets confirm the superiority of CANF-VC to the state-of-the-art methods.
Learned Video Compression for YUV 4:2:0 Content Using Flow-Based Conditional Inter-Frame Coding
Yung-Han Ho, Chih-Hsuan Lin, Peng-Yu Chen, Mu-Jung Chen, Chih-Peng Chang, Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), May. 2022.
This paper proposes a learning-based video compression framework that applies a conditional flow-based model for inter-frame coding and takes YUV 4:2:0 as the input format. Most learning-based video compression models use predictive coding and directly encode the residual signal, which is considered a sub-optimal solution. In addition, those models usually only operate on RGB, which is also regarded as an inefficient format. Furthermore, they require multiple models to fit on different bit rates. To solve these issues, we introduce a conditional flow-based video compression framework to improve the coding efficiency. To adapt to YUV 4:2:0 format, we incorporate lossless space-to-depth and depth-to-space transformation in our design. Lastly, we apply rate-adaption net on both I-frame and P-frame coder to achieve variable-rate coding and can further be extended to rate control applications. Our experimental results show comparable or better performance against x265 for UVG and MCL-JCV common test datasets in terms of PSNR-YUV.
P-frame Coding Proposal by NCTU: Parametric Video Prediction through Backprop-based Motion Estimation
Yung-Han Ho, Chih-Chun Chan, David Alexandre, Wen-Hsiao Peng, Chih-Peng Chang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2020.
This paper presents a parametric video prediction scheme with backprop-based motion estimation, in response to the CLIC challenge on P-frame compression. Recognizing that most learning-based video codecs rely on optical flow-based temporal prediction and suffer from having to signal a large amount of motion information, we propose to perform parametric overlapped block motion compensation on a sparse motion field. In forming this sparse motion field, we conduct the steepest descent algorithm on a loss function for identifying critical pixels, of which the motion vectors are communicated to the decoder. Moreover, we introduce a critical pixel dropout mechanism to strike a good balance between motion overhead and prediction quality. Compression results with HEVC-based residual coding on CLIC validation sequences show that our parametric video prediction achieves higher PSNR and MS-SSIM than optical flow-based warping. Moreover, our critical pixel dropout mechanism is found beneficial in terms of rate-distortion performance. Our scheme offers the potential for working with learned residual coding.

Learning-based Image Compression
Transformer-based Variable-rate Image Compression With Region-of-interest Control
Chia-Hao Kao, Ying-Chieh Weng, Yi-Hsin Chen, Wei-Chen Chiu, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2023.
This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the regionof-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive tokens according to the input image, an ROI mask, and a rate parameter. The separation of the ROI mask and the rate parameter allows an intuitive way to achieve variable-rate and ROI coding simultaneously. Extensive experiments validate the effectiveness of our proposed method and confirm its superiority over the other competing methods.
ANFIC: Image Compression Using Augmented Normalizing Flows
Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang, Marek Domanski
IEEE Open Journal of Circuits and Systems, Dec. 2021.
This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE's. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model. The source code of ANFIC can be found at https://github.com/dororojames/ANFIC.
End-to-End Learned Image Compression with Augmented Normalizing Flows
Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2021.
This paper presents a new attempt at using augmented normalizing flows (ANF) for lossy image compression. ANF is a specific type of normalizing flow models that augment the input with an independent noise, allowing a smoother transformation from the augmented input space to the latent space. Inspired by the fact that ANF can offer greater expressivity by stacking multiple variational autoencoders (VAE), we generalize the popular VAE-based compression framework by the autoencoding transforms of ANF. When evaluated on Kodak dataset, our ANF-based model provides 3.4% higher BD-rate saving as compared with a VAE-based baseline that implements hyper-prior with mean prediction. Interestingly, it benefits even more from the incorporation of a post-processing network, showing 11.8% rate saving as compared to 6.0% with the baseline plus post-processing.
A Hybrid Layered Image Compressor with Deep-Learning Technique
Wei-Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 2020.
The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.
Learned Image Compression With Soft Bit-based Rate-distortion Optimization
David Alexandre, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE International Conference on Image Processing (ICIP), Oct. 2019.
This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-ofthe- art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.
An Autoencoder-based Image Compressor with Principle Component Analysis and Soft-Bit Rate Estimation
Chih-Peng Chang, David Alexandre, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2019.
We propose a lossy image compression system using the deep-learning autoencoder structure to participate in the Challenge on Learned Image Compression (CLIC) 2018. Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact representation of the original image. The bit allocation and bitrate control are implemented by using the importance maps and quantizer. The importance maps are generated by a separate neural net in the encoder. The autoencoder and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM, and a rate estimate. Our aim is to produce reconstructed images with good subjective quality subject to the 0.15 bitsper-pixel constraint.

Learned Image and Video Coding for Machines
TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception
Yi-Hsin Chen, Ying-Chieh Weng, Chia-Hao Kao, Cheng Chien, Wei-Chen Chiu, and Wen-Hsiao Peng
International Conference on Computer Vision (ICCV), Oct. 2023.
This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task.

Reinforcement Learning for Video Encoder Control
Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265
Yung-Han Ho, Chia-Hao Kao, Wen-Hsiao Peng, Ping-Chun Hsieh
IEEE Visual Communications and Image Processing (VCIP), December. 2022.
This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines.
A Dual-Critic Reinforcement Learning Framework for Frame-level Bit Allocation in HEVC/H.265
Yung-Han Ho, Guo-Lun Jin, Yun Liang, Wen-Hsiao Peng, Xiao-Bo Li
Data Compression Conference (DCC), Mar. 2021.
This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.
Reinforcement Learning for HEVC/H.265 Intra-Frame Rate Control
Jun-Hao Hu, Wen-Hsiao Peng, Chia-Hua Chung
IEEE International Symposium on Circuits and Systems (ISCAS), Italy, May 2018.
Reinforcement learning has proven effective for solving decision making problems. However, its application to modern video codecs has yet to be seen. This paper presents an early attempt to introduce reinforcement learning to HEVC/H.265 intra-frame rate control. The task is to determine a quantization parameter value for every coding tree unit in a frame, with the objective being to minimize the frame-level distortion subject to a rate constraint. We draw an analogy between the rate control problem and the reinforcement learning problem, by considering the texture complexity of coding tree units and bit balance as the environment state, the quantization parameter value as an action that an agent needs to take, and the negative distortion of the coding tree unit as an immediate reward. We train a neural network based on Q-learning to be our agent, which observes the state to evaluate the reward for each possible action. When trained on only limited sequences, the proposed model can already perform comparably with the rate control algorithm in HM-16.15.
Reinforcement Learning for HEVC/H.265 Frame-level Bit Allocation
Lian-Ching Chen, Jun-Hao Hu, Wen-Hsiao Peng
IEEE International Conference on Digital Signal Processing (DSP), China, Nov. 2018.
Frame-level bit allocation is crucial to video rate control. The problem is often cast as minimizing the distortions of a group of video frames subjective to a rate constraint. When these video frames are related through inter-frame prediction, the bit allocation for different frames exhibits dependency. To address such dependency, this paper introduces reinforcement learning. We first consider frame-level texture complexity and bit balance as a state signal, define the bit allocation for each frame as an action, and compute the negative frame-level distortion as an immediate reward signal. We then train a neural network to be our agent, which observes the state to allocate bits to each frame in order to maximize cumulative reward. As compared to the rate control scheme in HM-16.15, our method shows better PSNR performance while having smaller bit rate fluctuations.
HEVC/H.265 Coding Unit Split Decision Using Deep Reinforcement Learning
Chia-Hua Chung, Wen-Hsiao Peng, Jun-Hao Hu
IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, Nov. 2017.
The video coding community has long been seeking more effective rate-distortion optimization techniques than the widely adopted greedy approach. The difficulty arises when we need to predict how the coding mode decision made in one stage would affect subsequent decisions and thus the overall coding performance. Taking a data-driven approach, we introduce in this paper deep reinforcement learning (RL) as a mechanism for the coding unit (CU) split decision in HEVC/H.265. We propose to regard the luminance samples of a CU together with the quantization parameter as its state, the split decision as an action, and the reduction in ratedistortion cost relative to keeping the current CU intact as the immediate reward. Based on the Q-learning algorithm, we learn a convolutional neural network to approximate the ratedistortion cost reduction of each possible state-action pair. The proposed scheme performs compatibly with the current full rate-distortion optimization scheme in HM-16.15, incurring a 2.5% average BD-rate loss. While also performing similarly to a conventional scheme that treats the split decision as a binary classification problem, our scheme can additionally quantify the rate-distortion cost reduction, enabling more applications.

Deep Video Prediction
Deep Video Prediction Through Sparse Motion Regularization
Yung-Han Ho, Chih Chun Chan, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2020.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data
SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning
Yung-Han Ho, Chuan-Yuan Cho, Wen-Hsiao Peng, Guo-Lun Jin
IEEE International Conference on Computer Vision (ICCV), Oct. 2019.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data.

Domain Adaptation for Semantic Segmentation
All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation
Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, Wei-Chen Chiu
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize imagetranslation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

Image / Video Semantic Segmentation
GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video
Shih-Po Lee, Si-Cun Chen, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2021.
This paper addresses fast semantic segmentation on video. Video segmentation often calls for real-time, or even faster than real-time, processing. One common recipe for conserving computation arising from feature extraction is to propagate features of few selected keyframes. However, recent advances in fast image segmentation make these solutions less attractive. To leverage fast image segmentation for furthering video segmentation, we propose a simple yet efficient propagation framework. Specifically, we perform lightweight flow estimation in 1/8-downscaled image space for temporal warping in segmentation outpace space. Moreover, we introduce a guided spatially-varying convolution for fusing segmentations derived from the previous and current frames, to mitigate propagation error and enable lightweight feature extraction on non-keyframes. Experimental results on Cityscapes and CamVid show that our scheme achieves the state-of-the-art accuracy-throughput trade-off on video segmentation.
Weakly-Supervised Image Semantic Segmentation Using Graph Convolutional Networks
Shun-Yi Pan*, Cheng-You Lu*, Shih-Po Lee, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2021.
This work addresses weakly-supervised image semantic seg-mentation based on image-level class labels. One commonapproach to this task is to propagate the activation scores ofClass Activation Maps (CAMs) using a random-walk mecha-nism in order to arrive at complete pseudo labels for traininga semantic segmentation network in a fully-supervised man-ner. However, the feed-forward nature of the random walkimposes no regularization on the quality of the resulting com-plete pseudo labels. To overcome this issue, we propose aGraph Convolutional Network (GCN)-based feature propaga-tion framework. We formulate the generation of completepseudo labels as a semi-supervised learning task and learna 2-layer GCN separately for every training image by back-propagating a Laplacian and an entropy regularization loss.Experimental results on the PASCAL VOC 2012 dataset con-firm the superiority of our scheme to several state-of-the-artbaselines.Index Terms?�Weakly-supervised image semantic seg-mentation, Graph Convolutional Networks
Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting
Stefanie Tanujaya, Tieh Chu, Jia-Hao Liu, Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), Spain, Oct 2020.
This paper addresses the problem of fast semantic segmentation on compressed video. Unlike most prior works for video segmentation, which perform feature propagation based on optical flow estimates or sophisticated warping techniques, ours takes advantage of block motion vectors in the compressed bitstream to propagate the segmentation of a keyframe to subsequent non-keyframes. This approach, however, needs to respect the inter-frame prediction structure, which often suggests recursive, multi-step prediction with error propagation and accumulation in the temporal dimension. To tackle the issue, we refine the motion-compensated segmentation using inpainting. Our inpainting network incorporates guided non-local attention for long-range reference and pixel-adaptive convolution for ensuring the local coherence of the segmentation. A fusion step then follows to combine both the motion-compensated and inpainted segmentations. Experimental results show that our method outperforms the state-of-the-art baselines in terms of segmentation accuracy. Moreover, it introduces the least amount of network parameters and multiply-add operations for non-keyframe segmentation.

Human Pose Estimation Using Radar
Human Pose Estimation Using Millimeter Wave Radar
Shih-Po Lee, Niraj Prakash Kini, Wen-Hsiao Peng, Ching-Wen Ma, Jenq-Neng Hwang
IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2023.
This paper introduces a novel human pose estimation benchmark, Human Pose with Millimeter Wave Radar (HuPR), that includes synchronized vision and radio signal components. This dataset is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. In addition to the benchmark, we propose a cross-modality training framework that leverages the ground-truth 2D keypoints representing human body joints for training, which are systematically generated from the pre-trained 2D pose estimation network based on a monocular camera input image, avoiding laborious manual label annotation efforts. Our intensive experiments on the HuPR benchmark show that the proposed scheme achieves better human pose estimation performance with only radar data, as compared to traditional pre-processing solutions and previous radio-frequency-based methods.

Video Synthesis
MoTIF: Learning Motion Trajectories with Local Implicit Neural Functions for Continuous Space-Time Video Super-Resolution
Yi-Hsin Chen*, Si-Cun Chen*, Yi-Hsin Chen, Yen-Yu Lin, Wen-Hsiao Peng
IEEE International Conference on Computer Vision (ICCV), Oct. 2023.
This work addresses continuous space-time video super-resolution (C-STVSR) that aims to up-scale an input video both spatially and temporally by any scaling factors. One key challenge of C-STVSR is to propagate information temporally among the input video frames. To this end, we introduce a space-time local implicit neural function. It has the striking feature of learning forward motion for a continuum of pixels. We motivate the use of forward motion from the perspective of learning individual motion trajectories, as opposed to learning a mixture of motion trajectories with backward motion. To ease motion interpolation, we encode sparsely sampled forward motion extracted from the input video as the contextual input. Along with a reliability-aware splatting and decoding scheme, our framework, termed MoTIF, achieves the state-of-the-art performance on C-STVSR.
Video Rescaling Networks with Joint Optimization Strategies for Downscaling and Upscaling
Yan-Cheng Huang*, Yi-Hsin Chen*, Cheng-You Lu, Hui-Po Wang, Wen-Hsiao Peng and Ching-Chun Huang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to suit individual viewing devices. We aim to jointly optimize video downscaling and upscaling as a combined task. Most recent studies focus on image-based solutions, which do not consider temporal information. We present two joint optimization approaches based on invertible neural networks with coupling layers. Our Long Short-Term Memory Video Rescaling Network (LSTM-VRN) leverages temporal information in the low-resolution video to form an explicit prediction of the missing high-frequency information for upscaling. Our Multi-input Multi-output Video Rescaling Network (MIMO-VRN) proposes a new strategy for downscaling and upscaling a group of video frames simultaneously. Not only do they outperform the image-based invertible model in terms of quantitative and qualitative results, but also show much improved upscaling quality than the video rescaling methods without joint optimization. To our best knowledge, this work is the first attempt at the joint optimization of video downscaling and upscaling.

Incremental Learning
Class-incremental Learning with Rectified Feature-Graph Preservation
Cheng-Hsun Lei*, Yi-Hsin Chen*, Wen-Hsiao Peng, Wei-Chen Chiu
Asian Conference on Computer Vision (ACCV), Japan, Nov. 2020.
In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes.

Visual Question Answering
Learning Goal-oriented Visual Dialogue: Imitating and Surpassing Analytic Experts
Yen-Wei Chang, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2019.
This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

Deep Generative Model
Learning Priors for Adversarial Autoencoders
Hui-Po Wang, Wen-Hsiao Peng, Wei-Jan Ko
Asia-Pacific Signal and Information Processing Association (APSIPA), USA, Nov. 2018.
Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.

AI Drone
Learning to Fly with a Video Generator
Chia-Chun Chung, Wen-Hsiao Peng, Teng-Hu Cheng and Chia-Hau Yu
IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2021.
This paper demonstrates a model-based reinforcement learning framework for training a self-flying drone. We implement the Dreamer proposed in a prior work as an environment model that responds to the action taken by the drone by predicting the next video frame as a new state signal. The Dreamer is a conditional video sequence generator. This model-based environment avoids the time-consuming interactions between the agent and the environment, speeding up largely the training process. This demonstration showcases for the first time the application of the Dreamer to train an agent that can finish the racing task in the Airsim simulator.