×
Research Teaching Members Join Us Publication
Research Content

Learning-based Video Compression
Fast-OMRA: Fast Online Motion Resolution Adaptation for Neural B-Frame Coding
Sang Nguyen Quang, Zong-Lin Gao, Kuan-Wei Ho, Xiem Hoang Van, and Wen-Hsiao Peng
IEEE Latin American Symposium on Circuits and Systems (LASCAS), Feb. 2025.
Most learned B-frame codecs with hierarchical temporal prediction suffer from the domain shift issue caused by the discrepancy in the Group-of-Pictures (GOP) size used for training and test. As such, the motion estimation network may fail to predict large motion properly. One effective strategy to mitigate this domain shift issue is to downsample video frames for motion estimation. However, finding the optimal downsampling factor involves a time-consuming rate-distortion optimization process. This work introduces lightweight classifiers to determine the downsampling factor. To strike a good rate-distortion-complexity trade-off, our classifiers observe simple state signals, including only the coding and reference frames, to predict the best downsampling factor. We present two variants that adopt binary and multi-class classifiers, respectively. The binary classifier adopts the Focal Loss for training, classifying between motion estimation at high and low resolutions. Our multi-class classifier is trained with novel soft labels incorporating the knowledge of the rate-distortion costs of different downsampling factors. Both variants operate as add-on modules without the need to re-train the B-frame codec. Experimental results confirm that they achieve comparable coding performance to the brute-force search methods while greatly reducing computational complexity.
On the Rate-Distortion-Complexity Trade-offs of Neural Video Coding
Yi-Hsin Chen, Kuan-Wei Ho, Martin Benjak, Jörn Ostermann, and Wen-Hsiao Peng
IEEE International Workshop on Multimedia Signal Processing (MMSP), Oct. 2024.
This paper aims to delve into the rate-distortioncomplexity trade-offs of modern neural video coding. Recent years have witnessed much research effort being focused on exploring the full potential of neural video coding. Conditional autoencoders have emerged as the mainstream approach to efficient neural video coding. The central theme of conditional autoencoders is to leverage both spatial and temporal information for better conditional coding. However, a recent study indicates that conditional coding may suffer from information bottlenecks, potentially performing worse than traditional residual coding. To address this issue, recent conditional coding methods incorporate a large number of high-resolution features as the condition signal, leading to a considerable increase in the number of multiply-accumulate operations, memory footprint, and model size. Taking DCVC as the common code base, we investigate how the newly proposed conditional residual coding, an emerging new school of thought, and its variants may strike a better balance among rate, distortion, and complexity.
MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression
Yi-Hsin Chen, Hong-Sheng Xie, Cheng-Wei Chen, Zong-Lin Gao, Martin Benjak, Wen-Hsiao Peng, and Jörn Ostermann
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024.
Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for interframe coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.
OMRA: Online Motion Resolution Adaptation to Remedy Domain Shift in Learned Hierarchical B-frame Coding
Zong-Lin Gao, Sang Nguyen Quang, Wen-Hsiao Peng, and Xiem Hoang Van
IEEE International Conference on Image Processing (ICIP), Oct. 2024.
Learned hierarchical B-frame coding aims to leverage bidirectional reference frames for better coding efficiency. However, the domain shift between training and test scenarios due to dataset limitations poses a challenge. This issue arises from training the codec with small groups of pictures (GOP) but testing it on large GOPs. Specifically, the motion estimation network, when trained on small GOPs, is unable to handle large motion at test time, incurring a negative impact on compression performance. To mitigate the domain shift, we present an online motion resolution adaptation (OMRA) method. It adapts the spatial resolution of video frames on a per-frame basis to suit the capability of the motion estimation network in a pre-trained B-frame codec. Our OMRA is an online, inference technique. It need not re-train the codec and is readily applicable to existing B-frame codecs that adopt hierarchical bi-directional prediction. Experimental results show that OMRA significantly enhances the compression performance of two state-of-the-art learned B-frame codecs on commonly used datasets.
Conditional Variational Autoencoders for Hierarchical B-Frame Coding
Zong-Lin Gao, Cheng-Wei Chen, Yi-Chen Yao, Cheng-Yuan Ho, and Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), May 2024.
In response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2024, this paper proposes a learned hierarchical B-frame coding scheme. Most learned video codecs concentrate on P-frame coding for the RGB content, while B-frame coding for the YUV420 content remains largely under-explored. Some early works explore Conditional Augmented Normalizing Flows (CANF) for B-frame coding. However, they suffer from high computational complexity because of stacking multiple variational autoencoders (VAE) and using separate Y and UV codecs. This work aims to develop a lightweight VAE-based B-frame codec in a conditional coding framework. It features (1) extracting multi-scale features for conditional motion and inter-frame coding, (2) performing frame-type adaptive coding for better bit allocation, and (3) a lightweight conditional VAE backbone that encodes YUV420 content by a simple conversion into YUV444 content for joint Y and UV coding. Experimental results confirms its superior compression performance to the CANF-based B-frame codec from the last year's challenge while having much reduced complexity.
Rate Adaptation for Learned Two-layer B-frame Coding without Signaling Motion Information
Hong-Sheng Xie, Yi-Hsin Chen, Wen-Hsiao Peng, Martin Benjak, and Jörn Ostermann
IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2023.
This paper explores the potential of a learned twolayer B-frame codec, known as TLZMC. TLZMC is one of the few early attempts that deviate from the hybrid-based coding architecture by skipping motion coding. With TLZMC, a low-resolution base layer is utilized to encode temporally unpredictable information. We address the question of whether adapting the base-layer bitrate can achieve better rate-distortion performance. We apply the feature map modulation technique to enable per-frame bitrate adaptation of the base layer. We then propose and compare three online search strategies for determining the base-layer rate parameter: per-level brute-force search, per-level greedy search, and per-frame greedy search. Experimental results show that our top-performing search strategy achieves 0.6%-15.8% Bjøntegaard-Delta rate savings over TLZMC.
Learning-Based Scalable Video Coding with Spatial and Temporal Prediction
Martin Benjak, Yi-Hsin Chen, Wen-Hsiao Peng, and Jörn Ostermann
IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2023.
In this work, we propose a hybrid learning-based method for layered spatial scalability. Our framework consists of a base layer (BL), which encodes a spatially downsampled representation of the input video using Versatile Video Coding (VVC), and a learning-based enhancement layer (EL), which conditionally encodes the original video signal. The EL is conditioned by two fused prediction signals: a spatial inter-layer prediction signal, that is generated by spatially upsampling the output of the BL using super-resolution, and a temporal inter-frame prediction signal, that is generated by decoder-side motion compensation without signaling any motion vectors. We show that our method outperforms LCEVC and has comparable performance to fullresolution VVC for high-resolution content, while still offering scalability.
Hierarchical B-frame Video Compression Using Two-layer CANF without Motion Coding
David Alexandre, Hsueh-Ming Hang, Wen-Hsiao Peng
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
Typical video compression systems consist of two main modules: motion coding and residual coding. This general architecture is adopted by classical coding schemes (such as international standards H.265 and H.266) and deep learning-based coding schemes. We propose a novel Bframe coding architecture based on two-layer Conditional Augmented Normalization Flows (CANF). It has the striking feature of not transmitting any motion information. Our proposed idea of video compression without motion coding offers a new direction for learned video coding. Our base layer is a low-resolution image compressor that replaces the full-resolution motion compressor. The low-resolution coded image is merged with the warped high-resolution images to generate a high-quality image as a conditioning signal for the enhancement-layer image coding in full resolution. One advantage of this architecture is significantly reduced computationa l complexity due to eliminating the motion information compressor . In addition, we adopt a skip-mode coding technique to reduce the transmitted latent samples. The rate-distortion performance of our scheme is slightly lower than that of the state-of-the-art learned B-frame coding scheme, B-CANF, but outperforms other learned B-frame coding schemes. However, compared to B-CANF, our scheme saves 45% of multiply-accumulate operations (MACs) for encoding and 27% of MACs for decoding. The code is available at https://nycu-clab.github.io.
B-CANF: Adaptive B-frame Coding with Conditional Augmented Normalizing Flows
Mu-Jung Chen, Yi-Hsin Chen, and Wen-Hsiao Peng
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023.
Over the past few years, learning-based video compression has become an active research area. However, most works focus on P-frame coding. Learned B-frame coding is under-explored and more challenging. This work introduces a novel B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. B-CANF additionally features two novel elements: frame-type adaptive coding and B*-frames. Our frame-type adaptive coding learns better bit allocation for hierarchical B-frame coding by dynamically adapting the feature distributions according to the B-frame type. Our B*-frames allow greater flexibility in specifying the group-of-pictures (GOP) structure by reusing the B-frame codec to mimic P-frame coding, without the need for an additional, separate P-frame codec. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance as compared to the other learned B-frame codecs.
Learned Hierarchical B-frame Coding with Adaptive Feature Modulation for YUV 4:2:0 Content
Mu-Jung Chen, Hong-Sheng Xie, Cheng Chien, Wen-Hsiao Peng, and Hsueh-Ming Hang
IEEE International Symposium on Circuits and Systems (ISCAS), May 2023.
This paper introduces a learned hierarchical Bframe coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. Bframe coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve contentadaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year's challenge on commonly used datasets in terms of PSNR-YUV.
Content-adaptive Motion Rate Adaption for Learned Video Compression
Chih-Hsuan Lin, Yi-Hsin Chen, and Wen-Hsiao Peng
Picture Coding Symposium (PCS), December 2022.
This paper introduces an online motion rate adaptation scheme for learned video compression, with the aim of achieving content-adaptive coding on individual test sequences to mitigate the domain gap between training and test data. It features a patch-level bit allocation map, termed the α-map, to trade off between the bit rates for motion and inter-frame coding in a spatially-adaptive manner. We optimize the α-map through an online back-propagation scheme at inference time. Moreover, we incorporate a look-ahead mechanism to consider its impact on future frames. Extensive experimental results confirm that the proposed scheme, when integrated into a conditional learned video codec, is able to adapt motion bit rate effectively, showing much improved rate-distortion performance particularly on test sequences with complicated motion characteristics. Index Terms—content-adaptive learned video compression, conditional inter-frame coding, bit allocation.
CANF-VC: Conditional Augmented Normalizing Flows for Video Compression
Yung-Han Ho, Chih-Peng Chang, Peng-Yu Chen, A. Gnutti, and Wen-Hsiao Peng
European Conference on Computer Vision (ECCV), Oct. 2022.
This paper presents an end-to-end learning-based video compression system, termed CANF-VC, based on conditional augmented normalizing flows (ANF). Most learned video compression systems adopt the same hybrid-based coding architecture as the traditional codecs. Recent research on conditional coding has shown the sub-optimality of the hybrid-based coding and opens up opportunities for deep generative models to take a key role in creating new coding frameworks. CANF-VC represents a new attempt that leverages the conditional ANF to learn a video generative model for conditional inter-frame coding. We choose ANF because it is a special type of generative model, which includes variational autoencoder as a special case and is able to achieve better expressiveness. CANF-VC also extends the idea of conditional coding to motion coding, forming a purely conditional coding framework. Extensive experimental results on commonly used datasets confirm the superiority of CANF-VC to the state-of-the-art methods.
Learned Video Compression for YUV 4:2:0 Content Using Flow-Based Conditional Inter-Frame Coding
Yung-Han Ho, Chih-Hsuan Lin, Peng-Yu Chen, Mu-Jung Chen, Chih-Peng Chang, Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), May. 2022.
This paper proposes a learning-based video compression framework that applies a conditional flow-based model for inter-frame coding and takes YUV 4:2:0 as the input format. Most learning-based video compression models use predictive coding and directly encode the residual signal, which is considered a sub-optimal solution. In addition, those models usually only operate on RGB, which is also regarded as an inefficient format. Furthermore, they require multiple models to fit on different bit rates. To solve these issues, we introduce a conditional flow-based video compression framework to improve the coding efficiency. To adapt to YUV 4:2:0 format, we incorporate lossless space-to-depth and depth-to-space transformation in our design. Lastly, we apply rate-adaption net on both I-frame and P-frame coder to achieve variable-rate coding and can further be extended to rate control applications. Our experimental results show comparable or better performance against x265 for UVG and MCL-JCV common test datasets in terms of PSNR-YUV.
P-frame Coding Proposal by NCTU: Parametric Video Prediction through Backprop-based Motion Estimation
Yung-Han Ho, Chih-Chun Chan, David Alexandre, Wen-Hsiao Peng, Chih-Peng Chang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2020.
This paper presents a parametric video prediction scheme with backprop-based motion estimation, in response to the CLIC challenge on P-frame compression. Recognizing that most learning-based video codecs rely on optical flow-based temporal prediction and suffer from having to signal a large amount of motion information, we propose to perform parametric overlapped block motion compensation on a sparse motion field. In forming this sparse motion field, we conduct the steepest descent algorithm on a loss function for identifying critical pixels, of which the motion vectors are communicated to the decoder. Moreover, we introduce a critical pixel dropout mechanism to strike a good balance between motion overhead and prediction quality. Compression results with HEVC-based residual coding on CLIC validation sequences show that our parametric video prediction achieves higher PSNR and MS-SSIM than optical flow-based warping. Moreover, our critical pixel dropout mechanism is found beneficial in terms of rate-distortion performance. Our scheme offers the potential for working with learned residual coding.

Learning-based Image Compression
LiDAR Depth Map Guided Image Compression Model
Alessandro Gnutti, Stefano Della Fiore, Mattia Savardi, Yi-Hsin Chen, Riccardo Leonardi, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2024.
The incorporation of LiDAR technology into some high-end smartphones has unlocked numerous possibilities across various applications, including photography, image restoration, augmented reality, and more. In this paper, we introduce a novel direction that harnesses LiDAR depth maps to enhance the compression of the corresponding RGB camera images. To the best of our knowledge, this represents the initial exploration in this particular research direction. Specifically, we propose a Transformer-based learned image compression system capable of achieving variable-rate compression using a single model while utilizing the LiDAR depth map as supplementary information for both the encoding and decoding processes. Experimental results demonstrate that integrating LiDAR yields an average PSNR gain of 0.83 dB and an average bitrate reduction of 16% as compared to its absence.
Transformer-based Learned Image Compression for Joint Decoding and Denoising
Yi-Hsin Chen, Kuan-Wei Ho, Shiau-Rung Tsai, Guan-Hsun Lin, Alessandro Gnutti, Wen-Hsiao Peng, Riccardo Leonardi
Picture Coding Symposium (PCS), June 2024.
This work introduces a Transformer-based image compression system. It has the flexibility to switch between the standard image reconstruction and the denoising reconstruction from a single compressed bitstream. Instead of training separate decoders for these tasks, we incorporate two add-on modules to adapt a pre-trained image decoder from performing the standard image reconstruction to joint decoding and denoising. Our scheme adopts a two-pronged approach. It features a latent refinement module to refine the latent representation of a noisy input image for reconstructing a noise-free image. Additionally, it incorporates an instance-specific prompt generator that adapts the decoding process to improve on the latent refinement. Experimental results show that our method achieves a similar level of denoising quality to training a separate decoder for joint decoding and denoising at the expense of only a modest increase in the decoder’s model size and computational complexity.
Learning-Based Conditional Image Compression
Tianma Shen, Wen-Hsiao Peng, Huang-Chia Shih, Ying Liu
IEEE International Symposium on Circuits and Systems (ISCAS), May 2024.
In recent years, deep learning-based image compression has achieved significant success. Most schemes adopt an end-to-end trained compression network with a specifically designed entropy model. Inspired by recent advances in conditional video coding, in this work, we propose a novel transformer-based conditional coding paradigm for learned image compression. Our approach first compresses a low-resolution version of the target image and up-scales the decoded image using an off-the-shelf super-resolution model. The super-resolved image then serves as the condition to compress and decompress the target highresolution image. Experiments demonstrate the superior ratedistortion performance of our approach compared to existing methods.
Transformer-based Image Compression with Variable Image Quality Objectives
Chia-Hao Kao, Yi-Hsin Chen, Cheng Chien, Wei-Chen Chiu, and Wen-Hsiao Peng
Asia-Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference, Oct. 2023.
This paper presents a Transformer-based image compression system that allows for a variable image quality objective according to the user's preference. Optimizing a learned codec for different quality objectives leads to reconstructed images with varying visual characteristics. Our method provides the user with the flexibility to choose a trade-off between two image quality objectives using a single, shared model. Motivated by the success of prompt-tuning techniques, we introduce prompt tokens to condition our Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of our method in adapting the encoding and/or decoding processes to a variable quality objective. While offering the additional flexibility, our proposed method performs comparably to the single-objective methods in terms of rate-distortion performance.
Transformer-based Variable-rate Image Compression With Region-of-interest Control
Chia-Hao Kao, Ying-Chieh Weng, Yi-Hsin Chen, Wei-Chen Chiu, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2023.
This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the regionof-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive tokens according to the input image, an ROI mask, and a rate parameter. The separation of the ROI mask and the rate parameter allows an intuitive way to achieve variable-rate and ROI coding simultaneously. Extensive experiments validate the effectiveness of our proposed method and confirm its superiority over the other competing methods.
ANFIC: Image Compression Using Augmented Normalizing Flows
Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang, Marek Domanski
IEEE Open Journal of Circuits and Systems, Dec. 2021.
This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE's. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model. The source code of ANFIC can be found at https://github.com/dororojames/ANFIC.
End-to-End Learned Image Compression with Augmented Normalizing Flows
Yung-Han Ho, Chih-Chun Chan, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2021.
This paper presents a new attempt at using augmented normalizing flows (ANF) for lossy image compression. ANF is a specific type of normalizing flow models that augment the input with an independent noise, allowing a smoother transformation from the augmented input space to the latent space. Inspired by the fact that ANF can offer greater expressivity by stacking multiple variational autoencoders (VAE), we generalize the popular VAE-based compression framework by the autoencoding transforms of ANF. When evaluated on Kodak dataset, our ANF-based model provides 3.4% higher BD-rate saving as compared with a VAE-based baseline that implements hyper-prior with mean prediction. Interestingly, it benefits even more from the incorporation of a post-processing network, showing 11.8% rate saving as compared to 6.0% with the baseline plus post-processing.
A Hybrid Layered Image Compressor with Deep-Learning Technique
Wei-Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE International Workshop on Multimedia Signal Processing (MMSP), Sep. 2020.
The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.
Learned Image Compression With Soft Bit-based Rate-distortion Optimization
David Alexandre, Chih-Peng Chang, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE International Conference on Image Processing (ICIP), Oct. 2019.
This paper introduces the notion of soft bits to address the rate-distortion optimization for learning-based image compression. Recent methods for such compression train an autoencoder end-to-end with an objective to strike a balance between distortion and rate. They are faced with the zero gradient issue due to quantization and the difficulty of estimating the rate accurately. Inspired by soft quantization, we represent quantization indices of feature maps with differentiable soft bits. This allows us to couple tightly the rate estimation with context-adaptive binary arithmetic coding. It also provides a differentiable distortion objective function. Experimental results show that our approach achieves the state-ofthe- art compression performance among the learning-based schemes in terms of MS-SSIM and PSNR.
An Autoencoder-based Image Compressor with Principle Component Analysis and Soft-Bit Rate Estimation
Chih-Peng Chang, David Alexandre, Wen-Hsiao Peng, Hsueh-Ming Hang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun. 2019.
We propose a lossy image compression system using the deep-learning autoencoder structure to participate in the Challenge on Learned Image Compression (CLIC) 2018. Our autoencoder uses the residual blocks with skip connections to reduce the correlation among image pixels and condense the input image into a set of feature maps, a compact representation of the original image. The bit allocation and bitrate control are implemented by using the importance maps and quantizer. The importance maps are generated by a separate neural net in the encoder. The autoencoder and the importance net are trained jointly based on minimizing a weighted sum of mean squared error, MS-SSIM, and a rate estimate. Our aim is to produce reconstructed images with good subjective quality subject to the 0.15 bitsper-pixel constraint.

Learned Image and Video Coding for Machines
TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception
Yi-Hsin Chen, Ying-Chieh Weng, Chia-Hao Kao, Cheng Chien, Wei-Chen Chiu, and Wen-Hsiao Peng
IEEE International Conference on Computer Vision (ICCV), Oct. 2023.
This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task.

Video Synthesis
MoTIF: Learning Motion Trajectories with Local Implicit Neural Functions for Continuous Space-Time Video Super-Resolution
Yi-Hsin Chen*, Si-Cun Chen*, Yi-Hsin Chen, Yen-Yu Lin, Wen-Hsiao Peng
IEEE International Conference on Computer Vision (ICCV), Oct. 2023.
This work addresses continuous space-time video super-resolution (C-STVSR) that aims to up-scale an input video both spatially and temporally by any scaling factors. One key challenge of C-STVSR is to propagate information temporally among the input video frames. To this end, we introduce a space-time local implicit neural function. It has the striking feature of learning forward motion for a continuum of pixels. We motivate the use of forward motion from the perspective of learning individual motion trajectories, as opposed to learning a mixture of motion trajectories with backward motion. To ease motion interpolation, we encode sparsely sampled forward motion extracted from the input video as the contextual input. Along with a reliability-aware splatting and decoding scheme, our framework, termed MoTIF, achieves the state-of-the-art performance on C-STVSR.
Video Rescaling Networks with Joint Optimization Strategies for Downscaling and Upscaling
Yan-Cheng Huang*, Yi-Hsin Chen*, Cheng-You Lu, Hui-Po Wang, Wen-Hsiao Peng and Ching-Chun Huang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to suit individual viewing devices. We aim to jointly optimize video downscaling and upscaling as a combined task. Most recent studies focus on image-based solutions, which do not consider temporal information. We present two joint optimization approaches based on invertible neural networks with coupling layers. Our Long Short-Term Memory Video Rescaling Network (LSTM-VRN) leverages temporal information in the low-resolution video to form an explicit prediction of the missing high-frequency information for upscaling. Our Multi-input Multi-output Video Rescaling Network (MIMO-VRN) proposes a new strategy for downscaling and upscaling a group of video frames simultaneously. Not only do they outperform the image-based invertible model in terms of quantitative and qualitative results, but also show much improved upscaling quality than the video rescaling methods without joint optimization. To our best knowledge, this work is the first attempt at the joint optimization of video downscaling and upscaling.

Image/Video Restoration
Using Conditional Video Compressors for Image Restoration
Yi-Hsin Chen, Yen-Kuan Ho, Ting-Han Lin, Wen-Hsiao Peng, Ching-Chun Huang
International Conference on Wireless and Optical Communications (WOCC), Oct. 2024.
To address the ill-posed nature of image restoration tasks, recent research efforts have been focused on integrating conditional generative models, such as conditional variational autoencoders (CVAE). However, how to condition the autoencoder to maximize the conditional evidence lower bound remains an open issue, particularly for the restoration tasks. Inspired by the rapid advancements in CVAE-based video compression, we make the first attempt to adapt a conditional video compressor for image restoration. In doing so, we have the low-quality image to be enhanced, which plays the same role as the reference frame for conditional video coding. Our scheme applies scalar quantization in training the autoencoder, circumventing the difficulties of training a large-size codebook as with prior works that adopt vector-quantized VAE (VQ-VAE). Moreover, it trains end-to-end a fully conditioned autoencoder, including a conditional encoder, a conditional decoder, and a conditional prior network, to maximize the conditional evidence lower bound. Extensive experiments confirm the superiority of our scheme on denoising and deblurring tasks.

Human Pose Estimation Using Radar
TransHuPR: Cross-View Fusion Transformer for Human Pose Estimation Using mmWave Radar
Niraj Prakash Kini, Ruey-Horng Shiue, Ryan Chandra, Wen-Hsiao Peng, Ching-Wen Ma, Jenq-Neng Hwang
British Machine Vision Conference (BMVC), Nov. 2024.
We present a novel Cross-View Fusion Transformer for Human Pose Estimation task based on mmWave Radar (TransHuPR). It is an mmWave Radar-based 2D Human Pose Estimation (HPE). Our work incorporates a 2D front projection view of the 3D pointcloud representation of the radar data as an input modality. The fusion transformer effectively fuses features derived from 2D front projection views of 2 independent radars and delivers high-quality predictions of human pose keypoints. We also introduce a new dataset consisting of fast actions with high frame rates as continuous radar sequences. Unlike other publicly available datasets, our dataset stands out because of its size, which ensures good generalization. We also incorporate singleaction and mixed-action sequences, making the dataset more challenging. We use a non-expensive multi-radar system, which can be easily replicated. Our proposed method demonstrates significant improvements over existing methods in terms of both average precision scores and qualitative analysis. The dataset and code are available at https://github.com/nirajpkini/TransHuPR
HuPR: A Benchmark for Human Pose Estimation Using Millimeter Wave Radar
Shih-Po Lee, Niraj Prakash Kini, Wen-Hsiao Peng, Ching-Wen Ma, Jenq-Neng Hwang
IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2023.
This paper introduces a novel human pose estimation benchmark, Human Pose with Millimeter Wave Radar (HuPR), that includes synchronized vision and radio signal components. This dataset is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. In addition to the benchmark, we propose a cross-modality training framework that leverages the ground-truth 2D keypoints representing human body joints for training, which are systematically generated from the pre-trained 2D pose estimation network based on a monocular camera input image, avoiding laborious manual label annotation efforts. Our intensive experiments on the HuPR benchmark show that the proposed scheme achieves better human pose estimation performance with only radar data, as compared to traditional pre-processing solutions and previous radio-frequency-based methods.

Reinforcement Learning for Video Encoder Control
Neural Frank-Wolfe Policy Optimization for Region-of-Interest Intra-Frame Coding with HEVC/H.265
Yung-Han Ho, Chia-Hao Kao, Wen-Hsiao Peng, Ping-Chun Hsieh
IEEE International Conference on Visual Communications and Image Processing (VCIP), December. 2022.
This paper presents a reinforcement learning (RL) framework that utilizes Frank-Wolfe policy optimization to solve Coding-Tree-Unit (CTU) bit allocation for Region-of-Interest (ROI) intra-frame coding. Most previous RL-based methods employ the single-critic design, where the rewards for distortion minimization and rate regularization are weighted by an empirically chosen hyper-parameter. Recently, the dual-critic design is proposed to update the actor by alternating the rate and distortion critics. However, its convergence is not guaranteed. To address these issues, we introduce Neural Frank-Wolfe Policy Optimization (NFWPO) in formulating the CTU-level bit allocation as an action-constrained RL problem. In this new framework, we exploit a rate critic to predict a feasible set of actions. With this feasible set, a distortion critic is invoked to update the actor to maximize the ROI-weighted image quality subject to a rate constraint. Experimental results produced with x265 confirm the superiority of the proposed method to the other baselines.
A Dual-Critic Reinforcement Learning Framework for Frame-level Bit Allocation in HEVC/H.265
Yung-Han Ho, Guo-Lun Jin, Yun Liang, Wen-Hsiao Peng, Xiao-Bo Li
Data Compression Conference (DCC), Mar. 2021.
This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.
Reinforcement Learning for HEVC/H.265 Intra-Frame Rate Control
Jun-Hao Hu, Wen-Hsiao Peng, Chia-Hua Chung
IEEE International Symposium on Circuits and Systems (ISCAS), Italy, May 2018.
Reinforcement learning has proven effective for solving decision making problems. However, its application to modern video codecs has yet to be seen. This paper presents an early attempt to introduce reinforcement learning to HEVC/H.265 intra-frame rate control. The task is to determine a quantization parameter value for every coding tree unit in a frame, with the objective being to minimize the frame-level distortion subject to a rate constraint. We draw an analogy between the rate control problem and the reinforcement learning problem, by considering the texture complexity of coding tree units and bit balance as the environment state, the quantization parameter value as an action that an agent needs to take, and the negative distortion of the coding tree unit as an immediate reward. We train a neural network based on Q-learning to be our agent, which observes the state to evaluate the reward for each possible action. When trained on only limited sequences, the proposed model can already perform comparably with the rate control algorithm in HM-16.15.
Reinforcement Learning for HEVC/H.265 Frame-level Bit Allocation
Lian-Ching Chen, Jun-Hao Hu, Wen-Hsiao Peng
IEEE International Conference on Digital Signal Processing (DSP), China, Nov. 2018.
Frame-level bit allocation is crucial to video rate control. The problem is often cast as minimizing the distortions of a group of video frames subjective to a rate constraint. When these video frames are related through inter-frame prediction, the bit allocation for different frames exhibits dependency. To address such dependency, this paper introduces reinforcement learning. We first consider frame-level texture complexity and bit balance as a state signal, define the bit allocation for each frame as an action, and compute the negative frame-level distortion as an immediate reward signal. We then train a neural network to be our agent, which observes the state to allocate bits to each frame in order to maximize cumulative reward. As compared to the rate control scheme in HM-16.15, our method shows better PSNR performance while having smaller bit rate fluctuations.
HEVC/H.265 Coding Unit Split Decision Using Deep Reinforcement Learning
Chia-Hua Chung, Wen-Hsiao Peng, Jun-Hao Hu
IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, Nov. 2017.
The video coding community has long been seeking more effective rate-distortion optimization techniques than the widely adopted greedy approach. The difficulty arises when we need to predict how the coding mode decision made in one stage would affect subsequent decisions and thus the overall coding performance. Taking a data-driven approach, we introduce in this paper deep reinforcement learning (RL) as a mechanism for the coding unit (CU) split decision in HEVC/H.265. We propose to regard the luminance samples of a CU together with the quantization parameter as its state, the split decision as an action, and the reduction in ratedistortion cost relative to keeping the current CU intact as the immediate reward. Based on the Q-learning algorithm, we learn a convolutional neural network to approximate the ratedistortion cost reduction of each possible state-action pair. The proposed scheme performs compatibly with the current full rate-distortion optimization scheme in HM-16.15, incurring a 2.5% average BD-rate loss. While also performing similarly to a conventional scheme that treats the split decision as a binary classification problem, our scheme can additionally quantify the rate-distortion cost reduction, enabling more applications.

Deep Video Prediction
Deep Video Prediction Through Sparse Motion Regularization
Yung-Han Ho, Chih Chun Chan, Wen-Hsiao Peng
IEEE International Conference on Image Processing (ICIP), Oct. 2020.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data
SME-Net: Sparse Motion Estimation for Parametric Video Prediction through Reinforcement Learning
Yung-Han Ho, Chuan-Yuan Cho, Wen-Hsiao Peng, Guo-Lun Jin
IEEE International Conference on Computer Vision (ICCV), Oct. 2019.
This paper leverages a classic prediction technique, known as parametric overlapped block motion compensation (POBMC), in a reinforcement learning framework for video prediction. Learning-based prediction methods with explicit motion models often suffer from having to estimate large numbers of motion parameters with artificial regularization. Inspired by the success of sparse motion-based prediction for video compression, we propose a parametric video prediction on a sparse motion field composed of few critical pixels and their motion vectors. The prediction is achieved by gradually refining the estimate of a future frame in iterative, discrete steps. Along the way, the identification of critical pixels and their motion estimation are addressed by two neural networks trained under a reinforcement learning setting. Our model achieves the state-of-theart performance on CaltchPed, UCF101 and CIF datasets in one-step and multi-step prediction tests. It shows good generalization results and is able to learn well on small training data.

Domain Adaptation for Semantic Segmentation
All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation
Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, Wei-Chen Chiu
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize imagetranslation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.

Image / Video Semantic Segmentation
GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video
Shih-Po Lee, Si-Cun Chen, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2021.
This paper addresses fast semantic segmentation on video. Video segmentation often calls for real-time, or even faster than real-time, processing. One common recipe for conserving computation arising from feature extraction is to propagate features of few selected keyframes. However, recent advances in fast image segmentation make these solutions less attractive. To leverage fast image segmentation for furthering video segmentation, we propose a simple yet efficient propagation framework. Specifically, we perform lightweight flow estimation in 1/8-downscaled image space for temporal warping in segmentation outpace space. Moreover, we introduce a guided spatially-varying convolution for fusing segmentations derived from the previous and current frames, to mitigate propagation error and enable lightweight feature extraction on non-keyframes. Experimental results on Cityscapes and CamVid show that our scheme achieves the state-of-the-art accuracy-throughput trade-off on video segmentation.
Weakly-Supervised Image Semantic Segmentation Using Graph Convolutional Networks
Shun-Yi Pan*, Cheng-You Lu*, Shih-Po Lee, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2021.
This work addresses weakly-supervised image semantic seg-mentation based on image-level class labels. One commonapproach to this task is to propagate the activation scores ofClass Activation Maps (CAMs) using a random-walk mecha-nism in order to arrive at complete pseudo labels for traininga semantic segmentation network in a fully-supervised man-ner. However, the feed-forward nature of the random walkimposes no regularization on the quality of the resulting com-plete pseudo labels. To overcome this issue, we propose aGraph Convolutional Network (GCN)-based feature propaga-tion framework. We formulate the generation of completepseudo labels as a semi-supervised learning task and learna 2-layer GCN separately for every training image by back-propagating a Laplacian and an entropy regularization loss.Experimental results on the PASCAL VOC 2012 dataset con-firm the superiority of our scheme to several state-of-the-artbaselines.Index Terms?�Weakly-supervised image semantic seg-mentation, Graph Convolutional Networks
Semantic Segmentation on Compressed Video Using Block Motion Compensation and Guided Inpainting
Stefanie Tanujaya, Tieh Chu, Jia-Hao Liu, Wen-Hsiao Peng
IEEE International Symposium on Circuits and Systems (ISCAS), Spain, Oct 2020.
This paper addresses the problem of fast semantic segmentation on compressed video. Unlike most prior works for video segmentation, which perform feature propagation based on optical flow estimates or sophisticated warping techniques, ours takes advantage of block motion vectors in the compressed bitstream to propagate the segmentation of a keyframe to subsequent non-keyframes. This approach, however, needs to respect the inter-frame prediction structure, which often suggests recursive, multi-step prediction with error propagation and accumulation in the temporal dimension. To tackle the issue, we refine the motion-compensated segmentation using inpainting. Our inpainting network incorporates guided non-local attention for long-range reference and pixel-adaptive convolution for ensuring the local coherence of the segmentation. A fusion step then follows to combine both the motion-compensated and inpainted segmentations. Experimental results show that our method outperforms the state-of-the-art baselines in terms of segmentation accuracy. Moreover, it introduces the least amount of network parameters and multiply-add operations for non-keyframe segmentation.

Incremental Learning
Class-incremental Learning with Rectified Feature-Graph Preservation
Cheng-Hsun Lei*, Yi-Hsin Chen*, Wen-Hsiao Peng, Wei-Chen Chiu
Asian Conference on Computer Vision (ACCV), Japan, Nov. 2020.
In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes.

Visual Question Answering
Learning Goal-oriented Visual Dialogue: Imitating and Surpassing Analytic Experts
Yen-Wei Chang, Wen-Hsiao Peng
IEEE International Conference on Multimedia and Expo (ICME), July 2019.
This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

Deep Generative Model
Learning Priors for Adversarial Autoencoders
Hui-Po Wang, Wen-Hsiao Peng, Wei-Jan Ko
Asia-Pacific Signal and Information Processing Association (APSIPA), USA, Nov. 2018.
Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.

AI Drone
Learning to Fly with a Video Generator
Chia-Chun Chung, Wen-Hsiao Peng, Teng-Hu Cheng and Chia-Hau Yu
IEEE International Conference on Visual Communications and Image Processing (VCIP), Dec. 2021.
This paper demonstrates a model-based reinforcement learning framework for training a self-flying drone. We implement the Dreamer proposed in a prior work as an environment model that responds to the action taken by the drone by predicting the next video frame as a new state signal. The Dreamer is a conditional video sequence generator. This model-based environment avoids the time-consuming interactions between the agent and the environment, speeding up largely the training process. This demonstration showcases for the first time the application of the Dreamer to train an agent that can finish the racing task in the Airsim simulator.