Research Topics
(Four papers submitted to CVPR22)
- Trajectory Prediction for Road Scenes (Coming soon) - ECCV20
- Holistic Road Layout Understanding (Coming soon) - CVPR20
- Occlusion Aware Indoor Scene Understanding (Coming soon)- CVPR20
- Active Learning for Domain Adaptation (Coming soon) - WACV20
- Parametric Road Layout Understanding (Coming soon) - CVPR19
- Active Learning for HPE (Coming soon) - ICCV17
- Active Inference in Videos - CVPR15
- Anytime Scene Understanding - ECCV16
- Holistic Scene Understanding - WACV14
- Video Parsing with Object Cues - WACV15
Active Inference in Videos - CVPR15
We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end, we first propose an object-augmented dense CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and imposes consistency between object and supervoxel labels. We develop an efficient mean field inference algorithm to jointly infer the supervoxel labels, object activations and their occlusion relations for a moderate number of object hypotheses. To scale up our method, we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense- CRF. We formulate the problem as a Markov Decision Process, which learns an approximate optimal policy based on a reward of accuracy improvement and a set of well-designed model and input features. We evaluate our method on three publicly available multiclass video semantic segmentation datasets and demonstrate superior efficiency and accuracy.
- Efficient inference for object-augmented denseCRF model for multi-class semantic video segmentaion task.
- Active inference method to select the informative subset of object proposals to scale up with a large number of proposals.
[ pdf ]
Anytime Scene Understanding - ECCV16
With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale or time-sensitive vision applications. We propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible tradeoffs between efficiency and accuracy in pixel-level prediction. In particular, our approach incorporates the cost of feature computation and model inference and optimises the model performance for any given test-time budget by learning a sequence of image-adaptive hierarchical models. We formulate this anytime representation learning as a Markov Decision Process with a discrete-continuous state-action space. A high-quality policy of feature and model selection is learned based on an approximate policy iteration method with action proposal mechanism. We demonstrate the advantages of our dynamic non-myopic anytime scene parsing on three semantic segmentation datasets, which achieves 90% of the state-of-the-art performances by using 15% of their overall costs.
- Propose a dynamic hierarchical model for anytime scene labeling that allows us to achieve flexible tradeoffs between efficiency and accuracy in pixel-level prediction.
- Learn a high-quality policy of feature and model selection based on an approximate policy iteration method.
[ pdf ]
Holistic Dynamic Scene Understanding - WACV14
Video segmentation has been a popular research topic in recent years. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric and semantic consistent class, based on a model that combines the geometric and semantic information under a box framework. We also apply multilayer representation as a higher smoothness term. Moreover, our box representation proves to be beneficial in understanding both static and dynamic scenes. Lastly, we show that our full model can improve the result. Our method is tested in CamVid dataset and produces a 96% pixel as well as per-class accuracy in geometric label part. As for semantic part, we obtain the best per-class accuracy and the pixel level accuracy is comparable to the closest spatio-temporal settings.
- Joint infer multiple (semantic and geometric) labels consistently in video by introducing an intermediate scene representation.
- Introduce hierarchical structure to enforce long-range spatio-temporal labeling smoothness.
[ pdf ]
Semantic Video Parsing with Object Cues - WACV15
We tackle the problem of holistic dynamic scene understanding. In particular, we jointly reason multiple high-level relations in video sequences in a consistent manner. We propose a model to jointly infer the geometric and semantic relation as well as object occlusion relation in videos taken from a monocular camera. Moreover, we explore temporal information to effectively generate detection results with weaker detectors and reason occlusion relations between spatially occluded objects. Finally, our model is able to infer the semantic and geometric labeling, detection and the relative ordering between objects under one-shot optimisation. We demonstrate the effectiveness of our method in two datasets and show that our model can achieve comparable or much better results compare with the state-of-the-art per-class accuracy with less supervision.
- Exploit temporal cues to learn from less training data: 10 to 20 exemplars for each class.
[ pdf ]
- Joint infer supervoxel, object activation and their relationships (support and occlusion) in a unified energy function.