GSVNet: Guided Spatially Varying Convolution for Fast Semantic Segmentation on Video
Shih-Po Lee, Si-Cun Chen, Wen-Hsiao Peng
Abstract
This paper addresses fast semantic segmentation on video. Video segmentation often calls for real-time, or even faster than real-time, processing. One common recipe for conserving computation arising from feature extraction is to propagate features of few selected keyframes. However, recent advances in fast image segmentation make these solutions less attractive. To leverage fast image segmentation for furthering video segmentation, we introduce a guided spatially-varying convolution for fusing segmentations derived from the previous and current frames, to mitigate propagation error and enable lightweight feature extraction on non-keyframes.
Methodology
Spatial Propagation with Ideal-dealy Kernels
Use ideal-delay kernels to select the surrounding semantic predictions which may be the condidates of the next predictions
Lightweight Intra-frame Segmentation
Deal with dis-occluded areas and errors of the previous frame's segmentation
Guided Dynamic Filtering
Fuse the candidates from ideal-delay kernels and intra-frame features using dynamic filters which are both input-varying and spatilly-varying
Experimental Results
Citation
S. P. Lee, S. C. Chen, and W. H. Peng, "GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video" IEEE International Conference on Multimedia and Expo (ICME), July 2021.
@misc{lee2021gsvnet,
      title={GSVNet: Guided Spatially-Varying Convolution for Fast Semantic Segmentation on Video}, 
      author={Shih-Po Lee and Si-Cun Chen and Wen-Hsiao Peng},
      year={2021},
      eprint={2103.08834},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Sponsor