Communications with x264 Devs

[16:55] <cancan101> Hey

[16:55] <cancan101> I was wondering what the status of using CUDA to run x264 was?

[16:56] <Dark_Shikari> I haven't heard anything about it

[16:56] <cancan101> Were you working on it at some point?

[16:57] <Dark_Shikari> not really

[16:57] <cancan101> The motion estimation functions?

[16:57] <Dark_Shikari> there was a sort of feasibility study done a while back and, well, it didn't look very feasible unless someone very good wanted to donate many months of their time for no promise of any potential benefit.

[16:58] <holger_> that being the reason you didn't take the cuda gsoc application?

[16:58] <Dark_Shikari> every few weeks someone comes in saying they'll do it, and then doesn't actually do anything

[16:58] <Dark_Shikari> so I've been led to believe that nobody actually intends to do it

[16:58] <Dark_Shikari> and rather just thinks it would be a cool thing to talk about.

[16:58] <cancan101> Well it definitely seems cool to talk about

[16:58] <Yuvi> well it is a cool thing to talk about

[16:59] <rav0> what about me?

[16:59] <Yuvi> just not very practical

[16:59] <cancan101> but it also seems liek there has been some commerical work on it

[16:59] <holger_> i think cuda is overrated atm. it has it's uses, but they seem relatively limited right now.

[16:59] <Dark_Shikari> yes, lots of commercial work it seems

[16:59] <Dark_Shikari> but call me when that commercial work actually produces a usable product.

[17:00] <cancan101> So does badaboom not work (well)

[17:00] <cancan101> ?

[17:00] <Dark_Shikari> no

[17:00] <Dark_Shikari> it's slower than x264, thus completely useless

[17:00] <holger_> that is going to change when we get graphics cores on the cpu.

[17:00] <Dark_Shikari> you can beat x264 in three ways: quality, speed, and features

[17:00] <Dark_Shikari> quality, hell no, no chance

[17:01] <Dark_Shikari> features: no chance, while there are audiences x264 doesn't aim at that require even fancier things than x264 does, these encoders are far simpler

[17:01] <holger_> (cuda or something else sharing the same caches - now that could be something)

[17:01] <Dark_Shikari> speed: that's the only chance

[17:01] <Dark_Shikari> and it fails at that, too

[17:01] <Dark_Shikari> so, summary, it fails.

[17:01] <cancan101> Are there any commerical attempts that beat x264 at speed?

[17:01] <cancan101> other than badaboom

[17:02] <Dark_Shikari> well, speed is a relative term, speed relative to what

[17:02] <holger_> ati has something too. but they chose to publish numbers for coding hd video to 320x240 or something. who would want to do that?

[17:02] <Dark_Shikari> if you just mean raw encoding throughput, you could chain enough fpgas together to outperform x264, sure

[17:02] <Yuvi> youtube?

[17:02] <Dark_Shikari> but nothing acutally practical.

[17:02] <Dark_Shikari> at least not practical if you're trying to max encoding throughput for a given cost

[17:03] <cancan101> Fair enough

[17:03] <Dark_Shikari> I don't know anything as fast as x264, or anywhere close

[17:03] <Dark_Shikari> now, one could make the claim that some other encoders might be better than x264 at some point on the speed-quality curve (if you take "Quality" as PSNR)

[17:03] <Dark_Shikari> and I think there are some points on the curve at which Mainconcept might win if you're solely measuring PSNR.

[17:03] <cancan101> So in your opinion and based on the feasibility study done in the past, it is entirely not trivial to modify the x264 code to run in CUDA?

[17:04] <Dark_Shikari> of course, it requires completely rewriting all code that you want to port

[17:04] <cancan101> also was the study done on CUDA ver 2.0

[17:04] <Dark_Shikari> from scratch

[17:04] <Dark_Shikari> and changing all the algorithms completely

[17:04] <Dark_Shikari> because what is fast on CPU is not fast on GPU

[17:04] <cancan101> Right

[17:04] <holger_> also, what most people tend to overlook with cuda: cuda isn't exactly in a tight power envelope.

[17:04] <Dark_Shikari> now, we already know what algorithm we'd want to implement on CUDA, because it's the standard hardware algorithm designed to minimize bandwidth usage and avoid having to estimate in raster order

[17:05] <rav0> standards are designed to be broken

[17:07] <cancan101> And the study/ pervious work, was it done using UDA ver 2.?

[17:07] <Dark_Shikari> probably not at the time

[17:07] <Dark_Shikari> I don't think it existed

[17:07] <Dark_Shikari> but I doubt it changes anything

[17:07] <Dark_Shikari> you still need a CUDA expert willing to code for a few months

[17:08] <cancan101> Can you elaborate on: the standard hardware algorithm designed to minimize bandwidth usage and avoid having to estimate in raster order

[17:08] <Dark_Shikari> pyramidal search

[17:09] <Dark_Shikari> its what badaboom uses, and odds are most fpga encoders do it as well

[17:10] <Dopefish> i cut my cake into pieces using a pyramidal search

[17:10] <Dark_Shikari> mm, cake.

[17:11] <cancan101> And what is the background of that project as it relates to CUDA?

[17:12] <Dark_Shikari> which project?

[17:13] <cancan101> Implmenting the algrotihm that minimizes bandwidth usage and avoids having to estimate in raster order in CUDA

[17:13] <Dark_Shikari> whoever said anything about implementing a pyramidal search?

[17:13] <Dark_Shikari> as I said, it would require that rather hard to find cuda programmer ;)

[17:13] <Dopefish> and paying him

[17:14] <Dark_Shikari> well, moreso finding a volunteer

[17:14] <Dark_Shikari> Avail doesn't have it in the budget to pay one =p

[17:14] <Dark_Shikari> nehalems are Fast Enough(TM)

[17:14] <cancan101> Do you work for Avail?

[17:14] <Dark_Shikari> Sometimes.

[17:15] <cancan101> The reason I'm asking these question is that my school has a CUDA class the involves a project

[17:15] * holger_ wonders if that would make for a phd thesis provided the diploma works out

[17:15] <cancan101> Though this project might be a bit too much for the levek of the class

[17:15] <Dark_Shikari> oh dear, another person in a CUDA class who wants to do x264 in CUDA as a project

[17:15] <cancan101> Ah

[17:15] <cancan101> So i'm not the first

[17:15] <Dark_Shikari> If only we got them all together and had them actually do the work...

[17:16] <Dark_Shikari> Well, we're completely open to implenting such a thing

[17:16] <Dark_Shikari> even if it's very very simplified

[17:16] <Dark_Shikari> even if it doesn't support threads, multiref, B-frames, interlacing

[17:16] <Dark_Shikari> as long as it just works and is it a starting point

[17:16] <Dark_Shikari> but this requires actually doing it, which so many people seem to not be very fond of.

[17:18] <holger_> and probably a decent amount of understanding the underlying hardware. i haven't looked at cuda recently, but i fear the api may be lacking.

[17:18] <Yuvi> that's one of the things that annoyed me when I did cuda/glsl stuff

[17:18] <Dark_Shikari> the algorithm itself is surprisingly simple and can be described in a few lines

[17:18] <Yuvi> documentation on what _is_ fast is rather lacking, and kinda handwaved

[17:18] <Dark_Shikari> if you don't do subpartitions and don't do subpel, it only requires three operations

[17:18] <Dark_Shikari> add

[17:19] <Dark_Shikari> absolute difference

[17:19] <holger_> sounds like nehalem atm. lots of educated guessing there, sometimes wrong.

[17:19] <Dark_Shikari> and 2x downscale interpolation

[17:19] <Dark_Shikari> the last of which is utterly trivial on a GPU

[17:19] <Yuvi> with nehalem at least you have prior x86 cpus to go on as a base

[17:19] <Yuvi> from what I've read gpus vary wildly generation to generation

[17:20] <holger_> (my satd code has a case where using a "float" insn (which is actually int on conroe and penryn) is actually still faster on nehalem than doing it the way intel says.

[17:21] <cancan101> As far as CUDA API goes

[17:21] <cancan101> i beleive they have been working hard on imprioving it

[17:21] <cancan101> after all they want programmer to learn it rather than their competitors equivalents

[17:21] <Dark_Shikari> equivalents?

[17:21] <Dark_Shikari> the equivalents are so far behind that... oh god =p

[17:22] <cancan101> eg Larrabee

[17:22] <cancan101> not sure

[17:22] <Dark_Shikari> opencl isn't even supported yet

[17:22] <Dark_Shikari> ATI's Closer to Metal failed

[17:22] <Dark_Shikari> larrabee isn't even out yet

[17:23] <cancan101> true

[17:23] <cancan101> ah well

[17:23] <cancan101> point was that they have been working on API

[17:23] <cancan101> and dev resources

[17:23] <Dark_Shikari> all of which help CUDA devs, which we have zero of ;)

[17:24] <Dark_Shikari> (cuda devs are welcome)

[17:24] <holger_> larrabee is probably going to be easy though once you get your algo properly parallelized. (well maybe you have to think a bit more about the execution order again)

[17:24] <Dark_Shikari> and you will need to write tons of 512-bit simd.

[17:24] <holger_> 512bit? i thought this was going to be multicore p54c with just wider simd? but that much wider?

[17:25] <Dark_Shikari> yes.

[17:25] <Dark_Shikari> that much wider.

[17:25] <holger_> hmm. interesting.

[17:26] <Dark_Shikari> well, if you got 4x4 satd to work in 128-bit SSE

[17:26] <Dark_Shikari> then clearly you can do 16x4 with 512-bit.

[17:32] <Dark_Shikari> if you need a book that covers the basics of video coding, try: H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia

[17:32] <Dark_Shikari> its a textbook, with a chapter on basic video coding and sections on H.264 and MPEG-4 Part 2

[17:32] <Dark_Shikari> ignore the latter completely, and at least read thatchapter on basics

[17:33] <Dark_Shikari> H.264 is a decent read sometime, albeit long. unlike mpeg-4 part 2 it won't kill your brain