Click here to Download The Latest Version of       or Download other versions by clicking the Down Arrow at the bottom right of the page.

Release notes for SmoothD2 version a3:
Fixed a bug in the build system for SmoothD2 which left SmoothD2 dependent on a very common system dll "MSVCR80.DLL".  If MSVCR80.DLL was not present on the users machine or was an incompatible version SmoothD2 would not load.  The build now compiles the code from MSVCR80.DLL into SmoothD2 instead of getting it from MSVCR80.DLL at runtime.

Fixed a bug in smoothD2() that would cause smoothD2c() to fail if the input width was not mod 8 and a downsize > 1 was specified.
  smoothD2()   requires the input clip to be mod 2 in both height and width.  This is the normal AviSynth requirment for yv12 video.
  smoothD2c() requires the input clip to be mod 4 in both height and width.
              The reason mod 4 is required in smoothD2c() is
               In yv12 the chroma is stored at 1/2 the resolution of the luma.
               If the luma is mod 4 then the extracted chroma which is passed to SmoothD2() is mod 2. 

Improved edge of frame deblocking:  
  The blocks at the edge of the frame are mirrored and slightly smoothed.  This provides the best data available for the deblock algorithm to use when the algorithm shifts a block outside the edge of the frame.

Removed restrictions on what matrices can be used with Qtype 3.
Removed restrictions on the minimum Quant values that can be used with Qtype 1.

New argument "Cpr" available in smoothD2c()  Provides control over chroma desaturation at low luma levels.
  Cpr, Chroma Protection strength, is used to reduce "chroma vampire", Schwartzvald's perfect phrase describing color washout at high amounts of color smoothing.  
  This also provides an example use of the ZWmask argument.

New argument "ncpu" available in both smoothD2() and smoothD2c(), allowing the use of multiple cpu cores if available.

Major Speed improvements 
1: Restructured the main processing loop. Now instead of doing, for each shift of the frame. 
tmpFrame = processed shifted value
accumulator += tmpFrame
   it does
accumulator += processed shifted value
   This saves 64 frame copies for num_shift=4.
2: Converted all remaining C routines in the main processing loop to xmm (sse instructions).
3: Modified the weighted average computation, which produces the output pixel values, to, when appropriate, use right shifts instead of divisions.  As a result, speed tests are given for both the best case where only shifts are used and for worst case where only divisions used.
4: Added multithreading of main processing loop, allowing up to 4 CPUs to be used.
   The multithreading code path is completely bypassed if only 1 cpu is used, (default case)
5: As part of the edge of frame smoothing, the internal work frame is created with 16 byte alignment and mod 16 width.  This maximizes the speed of the xmm routines.

Jim Conklin,
Sep 9, 2012, 11:12 PM
Jim Conklin,
Sep 9, 2012, 11:14 PM