Communications with x264 Devs
[16:55] <cancan101> Hey
[16:55] <cancan101> I was wondering what the status of using CUDA to run x264 was?
[16:56] <Dark_Shikari> I haven't heard anything about it
[16:56] <cancan101> Were you working on it at some point?
[16:57] <Dark_Shikari> not really
[16:57] <cancan101> The motion estimation functions?
[16:57] <Dark_Shikari> there was a sort of feasibility study done a while back and, well, it didn't look very feasible unless someone very good wanted to donate many months of their time for no promise of any potential benefit.
[16:58] <holger_> that being the reason you didn't take the cuda gsoc application?
[16:58] <Dark_Shikari> every few weeks someone comes in saying they'll do it, and then doesn't actually do anything
[16:58] <Dark_Shikari> so I've been led to believe that nobody actually intends to do it
[16:58] <Dark_Shikari> and rather just thinks it would be a cool thing to talk about.
[16:58] <cancan101> Well it definitely seems cool to talk about
[16:58] <Yuvi> well it is a cool thing to talk about
[16:59] <rav0> what about me?
[16:59] <Yuvi> just not very practical
[16:59] <cancan101> but it also seems liek there has been some commerical work on it
[16:59] <holger_> i think cuda is overrated atm. it has it's uses, but they seem relatively limited right now.
[16:59] <Dark_Shikari> yes, lots of commercial work it seems
[16:59] <Dark_Shikari> but call me when that commercial work actually produces a usable product.
[17:00] <cancan101> So does badaboom not work (well)
[17:00] <cancan101> ?
[17:00] <Dark_Shikari> no
[17:00] <Dark_Shikari> it's slower than x264, thus completely useless
[17:00] <holger_> that is going to change when we get graphics cores on the cpu.
[17:00] <Dark_Shikari> you can beat x264 in three ways: quality, speed, and features
[17:00] <Dark_Shikari> quality, hell no, no chance
[17:01] <Dark_Shikari> features: no chance, while there are audiences x264 doesn't aim at that require even fancier things than x264 does, these encoders are far simpler
[17:01] <holger_> (cuda or something else sharing the same caches - now that could be something)
[17:01] <Dark_Shikari> speed: that's the only chance
[17:01] <Dark_Shikari> and it fails at that, too
[17:01] <Dark_Shikari> so, summary, it fails.
[17:01] <cancan101> Are there any commerical attempts that beat x264 at speed?
[17:01] <cancan101> other than badaboom
[17:02] <Dark_Shikari> well, speed is a relative term, speed relative to what
[17:02] <holger_> ati has something too. but they chose to publish numbers for coding hd video to 320x240 or something. who would want to do that?
[17:02] <Dark_Shikari> if you just mean raw encoding throughput, you could chain enough fpgas together to outperform x264, sure
[17:02] <Yuvi> youtube?
[17:02] <Dark_Shikari> but nothing acutally practical.
[17:02] <Dark_Shikari> at least not practical if you're trying to max encoding throughput for a given cost
[17:03] <cancan101> Fair enough
[17:03] <Dark_Shikari> I don't know anything as fast as x264, or anywhere close
[17:03] <Dark_Shikari> now, one could make the claim that some other encoders might be better than x264 at some point on the speed-quality curve (if you take "Quality" as PSNR)
[17:03] <Dark_Shikari> and I think there are some points on the curve at which Mainconcept might win if you're solely measuring PSNR.
[17:03] <cancan101> So in your opinion and based on the feasibility study done in the past, it is entirely not trivial to modify the x264 code to run in CUDA?
[17:04] <Dark_Shikari> of course, it requires completely rewriting all code that you want to port
[17:04] <cancan101> also was the study done on CUDA ver 2.0
[17:04] <Dark_Shikari> from scratch
[17:04] <Dark_Shikari> and changing all the algorithms completely
[17:04] <Dark_Shikari> because what is fast on CPU is not fast on GPU
[17:04] <cancan101> Right
[17:04] <holger_> also, what most people tend to overlook with cuda: cuda isn't exactly in a tight power envelope.
[17:04] <Dark_Shikari> now, we already know what algorithm we'd want to implement on CUDA, because it's the standard hardware algorithm designed to minimize bandwidth usage and avoid having to estimate in raster order
[17:05] <rav0> standards are designed to be broken
[17:07] <cancan101> And the study/ pervious work, was it done using UDA ver 2.?
[17:07] <Dark_Shikari> probably not at the time
[17:07] <Dark_Shikari> I don't think it existed
[17:07] <Dark_Shikari> but I doubt it changes anything
[17:07] <Dark_Shikari> you still need a CUDA expert willing to code for a few months
[17:08] <cancan101> Can you elaborate on: the standard hardware algorithm designed to minimize bandwidth usage and avoid having to estimate in raster order
[17:08] <Dark_Shikari> pyramidal search
[17:09] <Dark_Shikari> its what badaboom uses, and odds are most fpga encoders do it as well
[17:10] <Dopefish> i cut my cake into pieces using a pyramidal search
[17:10] <Dark_Shikari> mm, cake.
[17:11] <cancan101> And what is the background of that project as it relates to CUDA?
[17:12] <Dark_Shikari> which project?
[17:13] <cancan101> Implmenting the algrotihm that minimizes bandwidth usage and avoids having to estimate in raster order in CUDA
[17:13] <Dark_Shikari> whoever said anything about implementing a pyramidal search?
[17:13] <Dark_Shikari> as I said, it would require that rather hard to find cuda programmer ;)
[17:13] <Dopefish> and paying him
[17:14] <Dark_Shikari> well, moreso finding a volunteer
[17:14] <Dark_Shikari> Avail doesn't have it in the budget to pay one =p
[17:14] <Dark_Shikari> nehalems are Fast Enough(TM)
[17:14] <cancan101> Do you work for Avail?
[17:14] <Dark_Shikari> Sometimes.
[17:15] <cancan101> The reason I'm asking these question is that my school has a CUDA class the involves a project
[17:15] * holger_ wonders if that would make for a phd thesis provided the diploma works out
[17:15] <cancan101> Though this project might be a bit too much for the levek of the class
[17:15] <Dark_Shikari> oh dear, another person in a CUDA class who wants to do x264 in CUDA as a project
[17:15] <cancan101> Ah
[17:15] <cancan101> So i'm not the first
[17:15] <Dark_Shikari> If only we got them all together and had them actually do the work...
[17:16] <Dark_Shikari> Well, we're completely open to implenting such a thing
[17:16] <Dark_Shikari> even if it's very very simplified
[17:16] <Dark_Shikari> even if it doesn't support threads, multiref, B-frames, interlacing
[17:16] <Dark_Shikari> as long as it just works and is it a starting point
[17:16] <Dark_Shikari> but this requires actually doing it, which so many people seem to not be very fond of.
[17:18] <holger_> and probably a decent amount of understanding the underlying hardware. i haven't looked at cuda recently, but i fear the api may be lacking.
[17:18] <Yuvi> that's one of the things that annoyed me when I did cuda/glsl stuff
[17:18] <Dark_Shikari> the algorithm itself is surprisingly simple and can be described in a few lines
[17:18] <Yuvi> documentation on what _is_ fast is rather lacking, and kinda handwaved
[17:18] <Dark_Shikari> if you don't do subpartitions and don't do subpel, it only requires three operations
[17:18] <Dark_Shikari> add
[17:19] <Dark_Shikari> absolute difference
[17:19] <holger_> sounds like nehalem atm. lots of educated guessing there, sometimes wrong.
[17:19] <Dark_Shikari> and 2x downscale interpolation
[17:19] <Dark_Shikari> the last of which is utterly trivial on a GPU
[17:19] <Yuvi> with nehalem at least you have prior x86 cpus to go on as a base
[17:19] <Yuvi> from what I've read gpus vary wildly generation to generation
[17:20] <holger_> (my satd code has a case where using a "float" insn (which is actually int on conroe and penryn) is actually still faster on nehalem than doing it the way intel says.
[17:21] <cancan101> As far as CUDA API goes
[17:21] <cancan101> i beleive they have been working hard on imprioving it
[17:21] <cancan101> after all they want programmer to learn it rather than their competitors equivalents
[17:21] <Dark_Shikari> equivalents?
[17:21] <Dark_Shikari> the equivalents are so far behind that... oh god =p
[17:22] <cancan101> eg Larrabee
[17:22] <cancan101> not sure
[17:22] <Dark_Shikari> opencl isn't even supported yet
[17:22] <Dark_Shikari> ATI's Closer to Metal failed
[17:22] <Dark_Shikari> larrabee isn't even out yet
[17:23] <cancan101> true
[17:23] <cancan101> ah well
[17:23] <cancan101> point was that they have been working on API
[17:23] <cancan101> and dev resources
[17:23] <Dark_Shikari> all of which help CUDA devs, which we have zero of ;)
[17:24] <Dark_Shikari> (cuda devs are welcome)
[17:24] <holger_> larrabee is probably going to be easy though once you get your algo properly parallelized. (well maybe you have to think a bit more about the execution order again)
[17:24] <Dark_Shikari> and you will need to write tons of 512-bit simd.
[17:24] <holger_> 512bit? i thought this was going to be multicore p54c with just wider simd? but that much wider?
[17:25] <Dark_Shikari> yes.
[17:25] <Dark_Shikari> that much wider.
[17:25] <holger_> hmm. interesting.
[17:26] <Dark_Shikari> well, if you got 4x4 satd to work in 128-bit SSE
[17:26] <Dark_Shikari> then clearly you can do 16x4 with 512-bit.
[17:32] <Dark_Shikari> if you need a book that covers the basics of video coding, try: H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia
[17:32] <Dark_Shikari> its a textbook, with a chapter on basic video coding and sections on H.264 and MPEG-4 Part 2
[17:32] <Dark_Shikari> ignore the latter completely, and at least read thatchapter on basics
[17:33] <Dark_Shikari> H.264 is a decent read sometime, albeit long. unlike mpeg-4 part 2 it won't kill your brain