As well-known in the community, there exist tremendous challenges for the research of video event detection and human action recognition. These challenges are mainly attributed to the complex video background (which makes the foreground segmentation intractable), large variations in human pose, body scale, clothes, occlusion/self-occlusion and view point change. For understanding high level human behaviour, there exist large gaps between low level visual features and high-level human behaviour as well as the interactions between humans. Depth camera provides 3-D scene structure data which is complementary to RGB data obtained from conventional camera. It helps to alleviate the foreground segmentation and motion capture problem. However, the image quality produced by depth camera is much lower than that of RGB camera, and how to effectively fusing these two modalities for more discriminatively representing events and actions is still challenging.