Final Presentation
Outline
Background
H.264 video codec & encoding
Interframe encoding
MVs + reference = residual
Motion Estimation
It's paralellizable!
1920×1088 (1080p HD video)
= 8,160 Macroblocks
Where did the block go?
Search window: 16 is normal (maybe 32...)
(16*2+1)^2 = 1089
Per position: SAD (16x16)
Full search (exhaustive)
a lot!
Usually (i.e. x264) not done exhaustively, but per FRAME still a lot of work
Previous attempts
On CPU
x264- very optimized
in CUDA (Jae)
Wei-Nien Chen; Hsueh-Ming Hang, "H.264/AVC motion estimation implmentation on Compute Unified Device Architecture (CUDA)," Multimedia and Expo, 2008 IEEE International Conference on, pp.697-700, June 23 2008-April 26 2008.
S Ryoo, CI Rodrigues, SS Baghsorkhi, SS Stone, DB. "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA" 2008.
MVp problem
needed for cost calculations
Quality vs. speed
Our project
Deal with MVp problem to allow us to solve ME problem in parallel
Hierarchical (pyramid)
provides an estimate for the MVp
Alternatives: wavefront (figure out interblock dependencies)
CUDA implementation overview
thread organization
memory organization
Testing framework (Lawrence)
python framework for ME algorithm testing (replaces Matlab portion from before)
examples of C extensions for gold standard code
and examples of pyCuda code
side by side or overlay comparisons
Results (?)
motion estimation speedup
entire encoder speedup
video encoding/decoding demo
Conclusions
future extension
CUDA experiences (prescription for future improvement of language, architecture, tools, programming model, etc.)
Acknowledgement
Dark_Shikari (x264 dev)