Brent's Patches to the Ruby intepreter

At the end of this page you will find a number of 'C' language patches for the Ruby MRI.

The current release is MBARI 8 on MRI Ruby 1.8.7 patchlevel 352 -- Oct. 12, 2011

Thank the visionary folks at Engine Yard for sponsoring me to produce a parallel version of these patches for Ruby 1.8.6!

I've also posted the (up-to-date) original patches against 1.6.8 just for old times sake.

NOTE: These patches are now hosted on the brentr fork of matzruby on GitHub. See the Wiki for the latest instructions.

The most recent patches available on this page apply to MRI Ruby 1.8.7 patchlevel 72

As noted below, some of these patches have been incorporated into the official MRI Ruby 1.8.7 tree after patchlevel 72.

They have also been incorporated, largely intact, into Enterprise Ruby

I developed these patches over the course of five years working with Ruby at the Monterey Bay Aquarium Research Institute on a robotic underwater microbiology laboratory. When I started, Ruby 1.6.8 was the latest stable version.

Aside from bug fixes, the primary goal of these patches is to reduce the memory consumption of the 1.8 series Ruby interpreters. Happily, these same techniques tend also to increase the speed of most applications, but speed increase was not my primary concern.

Each of the six patches below (mbari1-6) fixes a specific problem with or optimizes some facet of the Ruby interpreter. The patches were intended to be applied in order, starting with official interpreter release 1.8.7-patchlevel72 from ruby-lang.org. However, you may be able to apply only a subset of them if you don't want a particular feature or optimization.

Please post any bugs, flames, benchmark results, requests for improvement, etc. to the ruby-core mailing list by replying to the message dated 12/20/08 titled:

[PATCH] Promising C coding techniques to reduce MRI's memory use

Recommended procedure for applying all patches to ruby-1.8.7-p72 (works on *nix hosts):

Recommended procedure for applying patches to ruby-1.8.6-p287:

    1. Download and Unpack the official MRI 1.8.6-patchlevel287 release tarball

    2. cd into its ruby-1.8.6-p287 subdirectory

    3. download the attached ruby-1.8.6-mbari8B.patch file into that directory

    4. apply patch with: patch -p1 < ruby-1.8.6-mbari8B.patch

Recommended procedure for applying patches to latest ruby-1.6.8 snapshot:

    1. Download Unpack the latest official MRI 1.6.8 snapshot tarball

    2. cd into its ruby subdirectory

    3. download the attached ruby-1.6.8-mbari8B.patch file into that directory

    4. apply patch with: patch -p1 < ruby-1.6.8-mbari8B.patch

Procedure for compiling any version with appropriate gcc options:

  1. configure CFLAGS="-O2 -fno-stack-protector"

  2. make

  3. ./ruby -v #should display "{version} MBARI 8B/0x**** on patchlevel xx"

  4. sudo make install #should install the patched Ruby under /usr/local/bin

**** will be replaced by the STACK_WIPE_SITES setting documented below. It varies depending on CPU type.

See below for more details, including some rudimentary tests.

CFLAGS notes:

Note that some older versions of gcc did not support (or need) the -fno-stack-protector option.

It should be omitted from the CFLAGS= only if your verison of gcc does not support it.

(compiling without -fno-stack-protector will slow the stack clearing of the MBARI4 and MBARI7 patches considerably)

Adding -fomit-frame-pointer on can increase speed by about 7% on an x86, but the resulting binaries are difficult to use in gdb.

Detailed Description of each MBARI Patch:

MBARI1.patch:

Mutli-threaded Ruby programs using Continuations may fail with a segmentation fault if a thread for which there exists one or more continuations dies. The marking operation for such Continuations was incorrectly trying to traverse the stack frames of the dead thread. The MBARI1 patch avoids these segfaults and saves memory by freeing all stacks in Continuations that refer to dead threads before the next marking operation occurs. It also adds one new core method: Continuation#thread returns the active thread to which the continuation refers or nil if that thread has died.

The attached file "contDead.rb" will quickly cause an unpatched interpreter to segfault. Once patched, it runs forever in a tight loop as intended.

The MBARI1 patch was integrated into the MRI Ruby 1.8 sometime after 1.8.7-patchlevel 72.

MBARI2.patch:

The Ruby execution stack in multi-threaded programs includes the frames of all threads. This is a side-effect of the "green threads" stack management technique used by the 1.6 and 1.8 series interpreters. On the surface, being able to see the parent thread's stack frames on child threads seems like it might be a feature, or at least that it could be viewed as a benign bug.

However, if the parent thread dies and the child continues, some of the references on the (dead) parent thread's stack may no longer be valid. This causes very occasional segfaulting during the marking phases subsequent garbage collections. Even if a dead thread's stack frames do not cause a segaults for your applications, they prevent the garbage collector from ever freeing any of the objects to which they refer. This can waste many megabytes of memory.

The MBARI2 patch ensures that the stack of any thread includes only frames it created. It also expliots that new constraint to significantly reduce the amount of memory copied during context switches by simply omitting the frames of belonging to threads other than with the one of interest.

MBARI3.patch:

Ruby's conservative garbage collector cannot tell whether machine words on the 'C' stack are object pointers or integers, etc. because there is no type information associated with them. A conservative collector works by "conserving" every object to which there could possibly be a reference. In the 1.8 and 1.6 series Ruby implementations, this means scanning the stack of each Thread and Continuation assuming that every word is an object pointer if it has a value could be so interpreted. In practice, this is not as bad is it may seem, as Ruby's collector does not consider pointers "inside" an object to be valid -- only those that point to its exact base address. So, even assuming thousands of live objects, a 32-bit address space will remain very sparsely populated with valid object pointers.

The garbage collector's leaking memory is not really its own fault. The trouble is that the 'C' machine stack is filled with object references. The main reason for this is that gcc compilers create overly large stack frames and do not initialize many values in them. Certain 'C' constructs used in the Ruby interpreter's core recursive expression evaluator generate especially large, sparse stack frames. The function rb_eval() is the worst offender, creating kilobyte sized stack frames for each invocation of a function that may call itself hundreds of times. This results in stacks that are hundreds of kilobytes, often full of old, dead object references that may never go away. If there were a gcc compiler option to initialize all local variables to zero whever a new stack frame is built, that would let the collector do its work properly, but no such option exists.

The MBARI3 patch tracks the maximum depth of the 'C' stack at critical points in the interpreter's execution so that, whenever the stack shrinks, its old contents can be replaced with zeros. One might think this would have no effect because each new stack frame would simply overwrite all the zeros. But, as noted in the previous paragraph, gcc does not optimize that way, rather it optimizes for time by writing only those words in each frame actually needed for a specific execution path through the function. If those values are left uninitialized, when the garbage collector runs, the collector interprets some of them as valid object pointers that must be preserved. Even though these "ghost values" were written by previous iterations of functions that are no longer active. This patch ensures that almost all those hitherto uninitialized values are zero when the collector runs.

In practice, for MBARI's ESP application, I observed that the process size increased slowly without bound over many hours of operation. After patching, the process size stayed below 10MB for days. Before, I had seen it increase to >24Mbytes. (Our ARM targets have only 32MB of DRAM!) Your mileage may vary, but I expect every long running application will see significant reduction in process size although some may run slightly slower due to the overhead of clearing the stack.

Try this little one-liner before and after the MBARI3 patch is applied:

$ ruby -e "loop {@x = callcc {|c| c}}"

Monitor the process size while it runs the loop. Without the patch, the ruby process quickly grows to consume all memory. On Ruby version 1.6.8 it will eventually segfault due to a stack overflow due to run away recursion in the garbage collector. With the patch, the process size should stay between 10 and 20MBytes.

The MBARI3 patch adds these methods to the GC module:

GC.growth #returns the number of bytes allocated since the last garbage collection

GC.limit #returns the maximum allowed GC.increase

GC.limit= bytes # sets the maximum allowed GC.increase

This patch also removes the code newly introduced in the 1.8 series garbarge collector that attempted to dynamically increase the GC.limit when allocating large objects. And, it eliminates the GC.stress and GC.stress= methods because GC.stress=true is equivalent to GC.limit=0

WARNING: this patch relies on having a working version of alloca(). The Microsoft 'C' library documentation indicates that it should work fine, but I have not tested it there.

MBARI4.patch:

As mentioned above, the 'C' code for rb_eval() function in Ruby's expression evaluator is written such that gcc creates a very large (circa 1 kByte) stack frame for it. Compiling with gcc's -mpreferred-stack-boundary=2 options helps only slightly in reducing the size of these frames. Since rb_eval() is highly recursive, it is not uncommon for the size of Ruby 'C' stacks to exceed 100 kBytes. Consider that switching thread's or calling a continuation involves copying Ruby's 'C' stacks. It is little wonder why these operations tend to be slow when using the 1.6 and 1.8 series Ruby interpreters.

The rb_eval() function allocates this large a stack frame because it is written as a single, huge switch statement where each case processes a particular type of node on in the parse tree being interpreted. There are about 100 of these cases. For any given invocation of rb_eval(), only one or two of them are even executed. Nonetheless, gcc allocates temporaries and explicitly declared local variables for the 90 or so unexecuted cases even though these remain uninitialized. Some have observed the gcc could do a better job of allowing disjoint execution paths to share "stack slots", but this seems unlikely to happen in the near future.

The MBARI4 patch factors every block of code inlined in rb_eval() into a separate static (non-inlined!) function. After being factored this way, rb_eval()'s stack frame size is reduced to about 40 bytes. Of course, certain types of nodes in the parse tree will call a factored function that allocates a couple hundred or so bytes of temporaries on the stack, but that space will be allocated only when it is actually needed. In practice, for the MBARI ESP application, I observe the stack size for each thread being reduced to about one quarter or less of its pre-patched size.

Conventental optimization wisdom predicts that the original monolithic rb_eval() would run faster that this patched, factored version. However, the Ruby interpeter is not a conventional 'C' program. It routinely copies and scans large sections of its own stack. These operations are faster when the stack is kept smaller, which may more than make up for the increased overhead of creating more (smaller) stack frames and clearing the stack.

The attached tests "bogus1.rb" and "bogus2.rb" are contrived to demonstrate how significant the speed improvement can be when working with multiple threads having greatly differing stack sizes. On my 1.6Ghz CoreDuo mac mini running Linux 2.6, the bogus1 test runs in 26 seconds unpatched compared to about 1 second with these patches. The bogus2 test runs in about 35 seconds unpatched, less than 3 seconds patched. These are contrived, extreme examples. But, all Ruby programs that employ Continuations or Multiple threads will benefit to some extent.

MBARI5.patch:

The 'C' setjmp and longjmp functions form the basis for all the thread, continuation and exception handling in the 1.6 and 1.8 Ruby interpreters. These functions were intended to allow jumps into active frames higher up the call stack. This is fine for exception handling, but not enough for thread and continuation context switches. These involve replacing the active call stack with another one, which may require extending the stack to make it larger. The only platform independent way to extend the call stack is to call functions that push new frames onto it.

The (pseudo-)code to extend the stack looked something like this before this patch:

stack_extend()

{

volatile VALUE space[1024];

if (stack still too small) stack_extend();

swap_threads...

This is recursing just to add the necessary bytes to the stack. It's actually an O(n/1024) algorithm.

If you have many threads with widely varying stack sizes (not unusual at all), you will waste some time here.

The MBARI5 patch replaces it with:

stack_extend()

{

volatile VALUE *space = alloca(# of bytes required to grow stack);

swap_threads...

This is O(1), as it should be.

For platforms that don't support alloca, the original recursive version of stack_extend() is used.

If your platform has a poorly implemented alloca, replace #if HAVE_ALLOCA with #if 0 in the function rb_thread_restore_context() to force use of a streamlined recursive version.

Note that, for recent versions of gcc, one should always compile with the -fno-stack-protector option to prevent it from emitting time-wasting code to check the validity of the stack pointer after each alloca().

An improved version of the MBARI5 patch was integrated into the MRI Ruby 1.8 sometime after 1.8.7-patchlevel 72.

Therefore, the patchset on github for Ruby 1.8.7 patchlevel 352 omits MBARI5.

MBARI6.patch:

The patch adds very simple methods for determining where in source text Method or Proc object were originally defined. After patching, instances of Method, UnboundMethod and Proc classes will respond to __file__ and __line__ methods. __file__ returns the name of the source file containing the code, while __line__ returns the line number from which the definition starts. The is analogous to Ruby's __FILE__ and __LINE__ keywords. Note that these methods raise ArgumentError if the Method or Proc was not defined in Ruby.

Readers familar with the latest version 1.9 developments may be pleased to note that:

class Proc

def source_location

[__file__, __line__]

rescue ArgumentError

nil

end

end

works as expected. And, this same definition may be repeated to support Method and UnboundMethod classes.

The attached file sourceref.rb demonstrates some of the useful things that can be done with these new methods.

MBARI7.patch:

MBARI6 plugs memory leaks and runs multi-threaded applications much faster, however it runs typical small benchmarks 3% to 12% slower than unpatched Ruby. MBARI7 provides detailed build-time configuration control over when stack clearing is done and optimizes the GC, so that MBARI7 can be as fast as unpatched Ruby even for small, single threaded benchmarks, while still effectively clearing ghost object references off the stack. This patch also fixes a couple benign bugs in MBARI3.

The #define STACK_WIPE_SITES is a bit mask that controls when and how the stack is "wiped" as follows: (excerpted from rubysig.h)

0x*001 --> wipe stack just after every thread_switch

0x*002 --> wipe stack just after every EXEC_TAG()

0x*004 --> wipe stack in CHECK_INTS

0x*010 --> wipe stack in while & until loops

0x*020 --> wipe stack before yield() in iterators and outside eval.c

0x*040 --> wipe stack on catch and thread save context

0x*100 --> update stack extent on each object allocation

0x*200 --> update stack extent on each object reallocation

0x*400 --> update stack extent during GC marking passes

0x*800 --> update stack extent on each throw (use with 0x040)

0x1000 --> use inline assembly code for x86, PowerPC, or ARM CPUs

0x0*** --> do not even call rb_wipe_stack()

0x2*** --> call dummy rb_wipe_stack() (for debugging and profiling)

0x4*** --> safe, portable stack clearing in memory allocated with alloca

0x6*** --> use faster, but less safe stack clearing in unallocated stack

0x8*** --> use faster, but less safe stack clearing (with inline code)

for most effective gc use 0x*707

for fastest micro-benchmarking use 0x0000

0x*770 prevents almost all memory leaks caused by ghost references

without adding much overhead for stack clearing.

Other good trade offs are 0x*270, 0x*703, 0x*303 or even 0x*03

In general, you may lessen the default -mpreferred-stack-boundary

only if using less safe stack clearing (0x6***). Lessening the

stack alignment with portable stack clearing (0x4***) may fail to clear

all ghost references off the stack.

When using 0x6*** or 0x8***, the compiler could insert

stack push(s) between reading the stack pointer and clearing

the ghost references. The register(s) pushed will be

cleared by the rb_gc_stack_wipe(), typically resulting in a segfault

or an interpreter hang.

STACK_WIPE_SITES of 0x8770 works well compiled with gcc on most machines

using the recommended CFLAGS="-O2 -fno-stack-protector". However...

If it hangs or crashes for you, try changing STACK_WIPE_SITES to 0x4770

and please report your details. i.e. CFLAGS, compiler, version, CPU

Clearing stack once in each looping and iterating construct and during context switches should be sufficient to prevent ghost references from linking between stack frames. This appears to be what causes loop {@x = callcc {|c| c}} and many large or multithreaded apps to leak badly.

Whenever possible, updating stack extent during garbage collection passes now causes a separate stack area to be used that cannot contaminate the main interpreter stack with ghost references after garbage_collect() returns. This technique is very quick and effective, so it is now enabled by default. Note that the virtual memory address space for the interpreter will increase by 1 - 8 MB as a result. However, the "hole" between the two stacks is never accessed, so this address space does not increase the physical memory required.

The MBARI7 patch adds this method to the GC module:

GC.exorcise #Purge ghost references from recently freed stack space

If you have STACK_WIPE_SITES defined such that the required automatic stack clearing has been disabled, you may see the same sort of leaks that plague the unpatched Ruby. In this case, try invoking GC.exorcise at critical points to eliminate the leak. Then consider how better to #define STACK_WIPE_SITES. It's primarily a debugging tool. Please do let me know if you come across a script that leaks with STACK_WIPE_SITES set to its default value of 0x*770 but does not leak when it is set to 0x*707.

On my 1.6Ghz CoreDuo MacMini, MBARI7 runs the standard Ruby test suite, producing exactly the same output as the unpatched ruby-1.8.7-p72.

This OpenOffice spreadsheet and HTML version depict the run time for these ruby performance tests with various STACK_WIPE_SITES.

MBARI8.patch:

Bug fixes, gc tuning, and new configure flags!

After a number of long, tedious debugging sessions to determine why MBARI7, while very solid on x86-32, was segfaulting on x86-64 CPUs, I discovered two latent bugs that had been exacerbated by my factoring of rb_eval() into separate eval functions for each node type (in MBARI 4). The bugs were the stuff of every programmer's worst nightmare in that they were:

    1. very difficult to reproduce reliably,

    2. resulted in random segfaults or test failures,

    3. did not occur in unpatched Ruby,

    4. did not occur on ARM and x86-32 CPUs,

    5. would occur only after running for long, variable periods -- often many hours,

    6. disappeared entirely whenever gcc's optimizer was disabled,

    7. disappeared if the unfactored rb_eval() was substituted in place of the factored one.

At the time I suspected all the bugs were the result of one or two errors in factoring the rb_eval(). In the end, it turned out to be much more "interesting" than that:

    1. The function evaluating a string literal node would, in some cases, pass the internal pointer of a newly created ruby string object into rb_reg_new(), which would derive a new regular expression object from it. Trouble was, gcc, when optimizing on a machine like the x86-64, would determine that the pointer to the newly created string object need not be stacked and in fact could be entirely "forgotten" as soon its text buffer was passed into rb_reg_new(). Nothing wrong with that... unless a GC pass happened to be triggered while deriving the new regular expression from that string object's internal text buffer. In which case, the string object would never be marked and, as a result, that string object and its text buffer would be prematurely freed, trashing the regular expression pattern, resulting in very occasional regex match failures and (very rare) heap corruption.

    2. eval.c is full of setjmp()s and longjmp() calls. These are tricky and error prone constructs for a number reasons. The most insidious of which is the fact that the 'C' spec does not require that non-volatile local variables in the function containing a setjmp() be preserved when it returns via longjmp(). A few of the unpatched functions containing EXEC_TAG() in eval.c missed this point, failing to declare as 'volatile' variables that might need to be preserved on the stack in exceptional cases (redo clauses in some contexts, for example). And, many variables were declared volatile that did not need to be, adding to the confusion. This coding rule is difficult to maintain during incremental development, even if one follows it properly in the first place. And, I added a few of my own with MBARI4, because I did not fully understand the volatile qualifier in this context, but tried to follow the "patterns" I saw in the existing code. The volatile qualifier here is to prevent variables from being cached in registers. So, of course, CPUs with large register files would be most susceptible to this class of bug.

.MBARI8 defines the default GC.limit as 2 million * sizeof(VALUE). Unpatched, MRI sets the default GC malloc limit to 8 million bytes. This penalized machines that used 64-bit VALUEs, which is somewhat ironic given that they would likely have more physical memory.

Two new flags were added to the configure shell script so that it is no longer necessary to patch the 'C' code to enable the MBARI_API or to fiddle with the STACK_WIPES_SITES mask. Running

$ configure --help

will document these. Here's a usage example:

$ configure CFLAGS="-O2 -fno-stack-protector" --enable-mbari-api --with-wipe-sites=0x0

Enables the MBARI patches core API mods (GC.limit, Continuation#thread, etc.) and disable all stack clearing.

Some of the fixes in MBARI8 were integrated into the MRI Ruby 1.8 sometime after 1.8.7-patchlevel 72, most were not.

Do continue to improve these patches by reporting any problems with them.

However, please do not report any failures to me that also occur when running the corresponding unpatched version of Ruby.

At this point, I'd be quite interested to learn of anyone's experiences, positive or negative, with these patches running under MS-Windows or Cygwin.