Research‎ > ‎

Real-time Video Analytics

Real-time Object Detection and Tracking in H.264/AVC Videos


An object detection and tracking technique can be applicable for video contents analysis in such applications as intelligent surveillance systems and interactive broadcasting services. Most of them utilize only raw pixel data for accurate performance (so called "Pixel Domain Approach"), but they require customized hardwares due to high computational complexity. On the other hand, the compressed domain approach which use encoded information such as motion vectors shows extremely fast computation time, but its performance is worse in a variety of natural scenes than the pixel domain approach. To overcome these limitations, we proposed both the dissimilarity minimization (DM) and the probabilistic spatiotemporal macroblock filtering (PSMF) algorithms as a hybrid algorithm of compressed domain and pixel domain for H.264/AVC compressed videos. Since we make use of partially decoded data as well as encoded information, the method show reliable performance in natural scenes as well as fast computation time enough to be performed in real-time. Also, these methods can support the color extraction of objects or object recognition, and deal with long-term occlusion.

Dissimilarity Minimization

For semi-automatic object detection and tracking, we adopted the dissimilarity energy minimization algorithm which uses motion vectors and partially decoded luminance signals to perform tracking adaptively according to properties of the target object in H.264/AVC videos. It is one of the feature-based approaches that tracks some feature points selected by a user. First, it roughly predicts the position of each feature point using motion vectors extracted from H.264/AVC bitstream. Then, it finds out the best position inside the given search region by considering three clues such as texture, form, and motion dissimilarity energies. Since just neighborhood regions of feature points are partially decoded to compute this energy, the computational complexity is scarcely increased. The set of the best positions of feature points in each frame is selected to minimize the total dissimilarity energy by dynamic programming. Also, weight factors for dissimilarity energies are adaptively updated by the neural network.

To demonstrate the performance of the proposed method, the tracking results of various objects have extracted from videos with CIF size. Fig.1(a) shows the tracking results of a rigid object with slow motion. Four feature points were well tracked in the uniform form of feature network. Fig.1(b) also shows the tracking result of a deformable object with fast motion. We can observe that tracking is successful even though the form of feature network is greatly changing due to fast three-dimensional motion. Fig.1(c) represents the visual results of partial decoding in P-frames. Only the neighborhood region of three feature points was partially decoded. When the JM reference software was used to read H.264/AVC bitstream, the computation time is shorter than 430ms/frame.

Figure 1. Object Tracking Results: (a) Coastguard, (b) Stefan, (c) Lovers

Probabilistic Spatiotemporal Macroblock Filtering

To detect and track all moving objects automatically in H.264/AVC videos with a stationary background, we applied probabilistic spatiotemporal macroblock filtering (PSMF) and partial decoding processes. The algorithm involves two algorithmic phases:

Extraction Phase: We roughly extract the block-level region of objects and construct the approximate object trajectories in each P-frame by the PSMF. We first eliminate probable background macroblocks, and then cluster the remaining macroblocks into several fragments. To distinguish object fragments from background fragments, we use two steps of filtering. At the first step, we filter off background fragments on the basis of block mode, transform coefficients and spatial structure of fragments. At the second step, we observe temporal consistency of each surviving fragment over a given period, and approximate the probability that each fragment would be a part of objects. The fragments with high occurrence probability are finally considered as a part of objects.

Refinement Phase: We then accurately refine the object trajectories which are roughly generated in the extraction stage. First, we partially decode only object blobs in each I-frame, and perform background subtraction in each I-frame and motion interpolation in each P-frame. Then, the color information of each object is extracted and recorded in a database so that we can identify each object in spite of its long occlusion or temporary vanishment.

The proposed method exhibited a satisfactory performance over 720 and 990 frames of the indoor and outdoor sequences respectively. It can be noticed in Fig. 2(a) that the performance in the indoor sequence was kept good even though the object was continually changing in size as a person was moving toward the camera-looking direction. Moreover, although the body parts (such as head, arms, and legs) of the person had different motion, the rectangle box of the object blob always enclosed the whole body precisely. Likewise, even in the outdoor sequence which contains multiple objects as shown in Fig. 2(b), the proposed algorithm did not fail to detect and track three persons separately.

Figure 2. Object Tracking Results: (a) Indoor, (b) Outdoor


  1. Wonsang You, M.S. Houari Sabirin, and Munchurl Kim, "Real-time detection and tracking of multiple objects with partial decoding in H.264/AVC bitstream domain," Proceedings of SPIE, N. Kehtarnavaz and M.F. Carlsohn, San Jose, CA, USA: SPIE, 2009, pp. 72440D-72440D-12.
  2. Wonsang You, "A Study on Moving Object Detection and Tracking with Partial Decoding in H.264|AVC Bitstream Domain," Master Thesis, ICU School of Engineering, KAIST, 2008.
  3. Wonsang You, M.S. Houari Sabirin, and Munchurl Kim, "Moving object tracking in H.264/AVC bitstream," Lecture Notes in Computer Science, vol. 4577, 2007, pp. 483-492.
  4. Wonsang You, "Analysis of H.264/AVC Encoder Reference Software," KAIST MCCB Lab Technical Reports, 17 July 2006.
  5. Wonsang You, "Spatial-Domain Transcoder with Reused Motion Vectors in H.264/AVC," KAIST ICC Class Reports, 2006.

Invited Talks


  1. Munchurl Kim and Wonsang You, "Apparatus and method of tracking object in bitstream," KR 20080096342  (A), 10 October 2008.

Demo Videos