Motion Estimation

Overview

See: http://en.wikipedia.org/wiki/Motion_estimation

Parallelizing Motion Estimation

Problem

The biggest problem of doing motion estimation on CUDA is getting the MVp for a given macroblock. The H.264 standard defines the MVP based on the macroblocks surrounding a given macroblock (as defined by the raster scan order). The “basic” predictor is the median of the motion vectors of the macroblock partitions or subpartitions immediately above, diagonally above and to the right, and immediately left of the current partition or sub-partition. The predictor is modified if (a) 16x8 or 8x16 partitions are chosen and/or (b) if some of the neighboring partitions are not available as predictors.

When performing the motion estimation calculations the cost of a given MV is determined as a linear combination of the SAD and the MV to get this SAD MINUS the MVp:

cost = SAD + lambda * [cost(MVx-MVPx)+cost(MVy-MVPy)]

If this MVP is ignored in the cost calculations, the chosen MV will not be optimal. Any solution in CUDA must address this problem. Two proposed solutions are using a hierarchical motion estimation to get a MVP for the block and determining the dependency ordering (see the slides Jae sent out) and using the waveform solution.

CUDA Solutions

Hierarchical (i.e. Pyramid) Motion Estimation

Wavefront

    • Standard approach in parallel programming

    • See slides that Jae sent out. (these are omitted as request of author)

      • Alternatively see: Data Partition for Wavefront Parallelization of H.264 Video Encoder (these seem to propose a similar solution)

    • The idea is to process in parallel MBs that do not share dependencies. In other words we do need to process certain macroblocks in serial.

Ignore Dependency Concerns

    • Approach taken in academic literature (see below)

    • Sacrifices quality for speed (doesn't deal with MVp issue)

x264 Encoder

Luminance and Chroma

When doing motion estimation, SAD is always performed on Luminance. Cr is only considered when 'chroma ME' is enabled, which (I think) is enabled at 'subme 4' (check GIT log, not that long ago it got merged into there).

fullpel and hpel search use luma information only, while qpel search will use luma and chroma information when chroma ME is enabled

Slices

There are only three reasons to use slices

    1. better error resilience (we don't care about this)

    2. threading (our threading is better than slice-based threading)

    3. decoder threading (good decoders don't need slices to thread)

for example: http://akuvian.org/src/x264/sliceless_threads.txt

SATD (Sum of Average Transformed Differences)

SATD uses the Hadamard transform (check p451ff of http://www.jjj.de/fxt/fxtbook.pdf if you want to know more). The hadamard transform is performed on the difference between the original block and the predicted block, then the absolute values of the hadamard transforms coefficients are summed to give the SATD score

It's one of the metrics for similarity of blocks.

it's used mostly in qpel search, deciding intra prediction directions and for MB mode decision (this is when RDO is disabled)

SATD isn' t used during the integer part of the search, except when '--me tesa' is used but not otherwise. So if --me is hex (as in by default), no SATD. if me is dia,hex,umh or esa then satd won't be used in the fullpel search. SATD is also used when you have 'i_subpel_refine > 1' mbcmp_init in encoder.c sets fpelcmp to sad or satd and mbcmp is always satd except when 'subme <= 1'

How much gain is there from tesa vs esa for motion estimation? very little, almost placebo effect

<Setsuna-Xero> if you want to optimize one thing that will speed everything up SATD and SAD are the two that will really show it but that said they're optimized beyond all reason

<holger> setsuna-xero: already being in the works. you'll be surprised soon ;)

<Setsuna-Xero> you mean theres more? 0_0

<holger> no, satd isn't. there's a lot more ;)

<Dark_Shikari> hadamard transform is trivial

<Dark_Shikari> its the frequency transform that can be represented by a matrix multiply with a matrix of nothing but 1s and -1s

<Dark_Shikari> i.e. the frequency transform calculatable with only adds and subtracts

<Dark_Shikari> the simplest possible transform

<holger> imagine the array going on top [0..x]. in each row, if there's a star below the array element, add it, otherwise sub it. row sums are transformed values. abs and sum them to get satd.

Forum Discussion

Sample Code

References

    • H.264 ME algorithm implemented in CUDA

      • Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation[sic] on Compute Unified Device Architecture (CUDA)," Multimedia and Expo, 2008 IEEE International Conference on, pp.697-700, June 23 2008-April 26 2008.

      • Chuan-Yiu Lee; Yu-Cheng Lin; Chi-Ling Wu; Chin-Hsiang Chang; You-Ming Tsao; Shao-Yi Chien, "Multi-Pass and Frame Parallel Algorithms of Motion Estimation in H.264/AVC for Generic GPU," Multimedia and Expo, 2007 IEEE International Conference on , vol., no., pp.1603-1606, 2-5 July 2007.

      • S Ryoo, CI Rodrigues, SS Baghsorkhi, SS Stone, DB. "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA" 2008.

    • H.264 Motion Estimation on GPU

      • "Multi-Pass and Frame Parallel Algorithms of Motion Estimation in H264 AVC for Generic GPU." Chuan-Yiu Lee, Yu-Cheng Lin, Chi-Ling Wu, Chin-Hsiang Chang,You-Ming Tsao, and Shao-Yi Chien.

    • H.264 Motion Estimation algorithm implemented on Intel architecture

      • Yen-Kuang Chen; Tian, X.; Steven Ge; Girkar, M., "Towards efficient multi-level threading of H.264 encoder on Intel hyper-threading architectures," Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International , vol., no., pp. 63-, 26-30 April 2004.

    • Motion Estimation

      • "A Block-Based Gradient Descent Search Algorithm for Block Motion Estimation in Video Coding." Lie, L; Feing E.