Improving Garbage Collection of Ruby Fibers

Project Goal

The goal of the project was to improve garbage collection of inactive Ruby fibers, which are a lightweight concurrency mechanism similar to threads and stackful coroutines. 

Background Information

Each Ruby fiber comes with its own 4 kB stack, and its execution can be suspended and resumed. Under Ruby's mark and sweep garbage collection (GC), all objects reachable from a root set are marked, and the remaining unmarked objects are considered garbage and swept. The root set includes registers, global variables, and any object references present on the execution context's stack.

Ruby's GC has incorporated things such as generational GC to shorten GC times, but it currently has no way of determining fiber workload and adapting based on that. Inactive fibers will not be seeing any modifications to their stacks while suspended, so values on their stack will not be changing, thus rescanning their stacks every GC may be unnecessary. As fibers are more lightweight than threads, more fibers can be created than threads, and so GC on inactive fibers may result in significant pause times with Ruby's Stop the World GC.

What I Worked On

Benchmark

One of the first things that was done was creating a benchmark that could act as a sort of stress test for evaluating GC. I started by working off a benchmark that my mentor had created, which essentially created a lot of fibers, and a lot of objects on the stacks of the fibers, show below:

def recurse(n)

# Create an object on the stack to make the GC do a little work:

object = Object.new

if n == 0

Fiber.yield

else

recurse(n - 1)

end

end


def make_fibers(n = 1000, m = 1000)

n.times.map do

Fiber.new{recurse(m)}.tap(&:resume)

end

end


def benchmark

fiber_full_stack_scan = Array.new

thread_full_stack_scan = Array.new

stack_scan_bytes = Array.new


15.times do |i|

count  = 2**i

puts "Count = #{count}"

fibers = make_fibers(count)

# The first time we do this, we expect all fibers to be fresh, and we should scan them:

GC.start(full_mark: true, immediate_sweep: true)


fiber_full_stack_scan.push(RubyVM.stat.slice(:fiber_full_stack_scan).to_a)

thread_full_stack_scan.push(RubyVM.stat.slice(:thread_full_stack_scan).to_a)

stack_scan_bytes.push(RubyVM.stat.slice(:stack_scan_bytes).to_a)


# The second time we do this, we would imagine that the fiber state has not changed, in theory it should not require any stack scanning:

GC.start(full_mark: false, immediate_sweep: true)


fiber_full_stack_scan.push(RubyVM.stat.slice(:fiber_full_stack_scan).to_a)

thread_full_stack_scan.push(RubyVM.stat.slice(:thread_full_stack_scan).to_a)

stack_scan_bytes.push(RubyVM.stat.slice(:stack_scan_bytes).to_a)


fibers.each do |fiber|

fiber.resume

end

fibers = nil

A major and minor GC were called in the benchmark. We were hoping to decrease the amount of time the minor GC would take in comparison to the major GC.

Initially, the benchmark was profiled using the built-in GC::Profiler. Discussions with my mentor led to the decision to add a few more criteria to profile the benchmark with, as the profiler didn't really provide specific information related to fibers and the stack, and GC time also varies noticeably from run to run. 

Using Ruby's debug counters, I added the following metrics:

cont_mark(void *ptr)

{

RB_DEBUG_COUNTER_INC(fiber_full_stack_scan);

    rb_context_t *cont = ptr;

    RUBY_MARK_ENTER("cont");

...

}

thread_mark(void *ptr)

{

    RB_DEBUG_COUNTER_INC(thread_full_stack_scan);

    rb_thread_t *th = ptr;

...

}

each_location(rb_objspace_t *objspace, register const VALUE *x, register long n, void (*cb)(rb_objspace_t *, VALUE))

{

    VALUE v;

    while (n--) {

        v = *x;

        cb(objspace, v);

        x++;

        RB_DEBUG_COUNTER_ADD(stack_scan_bytes, 8);

    }

}


static void

gc_mark_stack_values(rb_objspace_t *objspace, long n, const VALUE *values)

{

    long i;


    for (i=0; i<n; i++) {

        if (is_markable_object(values[i])) {

            gc_mark_and_pin(objspace, values[i]);

            RB_DEBUG_COUNTER_ADD(stack_scan_bytes, 8);

        }

    }

}

Finally, I added a reset to the debug counters that would be triggered at the beginning of every GC mark.

static void

gc_marks_start(rb_objspace_t *objspace, int full_mark)

{

    /* start marking */

    gc_report(1, objspace, "gc_marks_start: (%s)\n", full_mark ? "full" : "minor");

    gc_mode_transition(objspace, gc_mode_marking);

    ruby_debug_counter_reset();

...

}

I had quite the trouble trying to build Ruby with the USE_DEBUG_COUNTERS cppflag enabled on Windows, but thankfully my mentor directed me to a clearer resource on Ruby's GitHub instead.

Finding Possible Strategies to the Problem

At this point I was also doing quite a bit of self-studying on operating systems, coroutines, Ruby fibers, Ruby's GC and memory allocation, GC in general, etc. There were many concepts that I would technically be learning in my school courses next year. 

My mentor suggested that I explore the idea of a stack barrier, which would be used to mark part of the stack that doesn't need to be GC'd, specifically on a fiber that was yielded, and also to look into Java's Shenandoah GC and other concurrent GCs for inspiration. This led me to some reading on Go's concurrent GC and some work that was done to reduce stack rescanning, which I thought sounded somewhat related. I also went into the references in the Stack Barrier section of the GC Handbook, where I found several research papers discussing ideas related to concurrent GC, reducing stack scanning, and so on. It sounded like using a stack barrier to indicate part of a stack that didn't need marking could work, so I ended up continuing to work on this idea. Of course, this research did not happen all at once, and I did a lot of reading while I tried to figure out why things didn't work.

Implementing the Stack Barrier 

Initially I started by working off the pseudocode my mentor had sent me, shown below. A stack barrier would be associated with each stack.

// Per stack state:

  ... void *stack_barrier;


void rb_stack_barrier(VALUE (*yield)(VALUE), VALUE argument) {

  rb_stack * stack = current_stack();

  void * current_stack_barrier = stack->stack_barrier;

  stack->stack_barrier = current_stack_pointer();

  yield(argument);

  stack->stack_barrier = current_stack_barrier; // Restore previous stack barrier on exit

}

When a fiber is yielded, the stack barrier is placed at the top of the fiber's stack. When a fiber was resumed, its stack barrier was moved away from the top of the stack, to indicate that it needed scanning. I modified things so that a stack barrier would be placed on a fiber that GC had finished scanning. 

GC wouldn't mark the stack if the stack barrier was in place, i.e. cont_mark wouldn't mark it, as cont_mark does the actual scanning of the fiber's stack. Something along the lines of:

fiber_mark (void *ptr) 

{

...

if (!stack_barrier) cont_mark(fiber->cont);

}

I also wanted to set it up so that the entire stack would still be scanned if a major GC was occurring.

Anyways, this did not work. I added a 3rd GC in the benchmark to see if major GCs could bypass the barrier, and I ended up with segmentation faults and TRY TO MARK T_NONE errors. After some Googling and looking through Ruby source code, I figured out that this was because the 3rd GC was attempting to mark objects that had already been swept. Therefore, preventing cont_mark from being called was preventing the necessary marking of some things.

I did some more reading and I found the following resources particularly helpful:

I came to the conclusion that it was necessary to somehow save the results of a previous GC cycle in order to not rescan the entire stack, as any object references found on the stack is part of the root set, which always gets marked. 

Kliot, Petrank and Steensgard's paper introduced the idea of a FrameRecord that saved all pointers in a stack frame in a linked list to be used as a root set, for a concurrent, incremental GC. I decided to create a similar tracking list for inactive fibers.

The goal was to save any object references found in one GC cycle on the stacks of inactive fibers, and only scan those saved references in subsequent GCs, while the fiber was still inactive (i.e. unchanging).

For each fiber, I added a data structure called a fiber_record

struct rb_fiber_struct {

...

    struct fiber_pool_stack stack;


    struct fiber_record_struct fiber_record;


};

The fiber record consisted of:

struct fiber_record_struct 

{

    struct fiber_stack_object *head;

    struct fiber_stack_object *tail;


    // Stack barrier points to some position on stack

    // that GC cannot go beyond 

    void * stack_barrier;


    void *fiber;

}

With the nodes in the linked list being:

struct fiber_stack_object 

{

    //a pointer to a object pointer on a fiber stack

    VALUE *stack_obj;

    struct fiber_stack_object *next;

};

The fiber record's values would be initialized in fiber_t_alloc by fiber_record_init:

void 

fiber_record_init(struct fiber_record_struct *new_record, void *ptr) 

{

    rb_fiber_t *fiber = ptr;

    new_record->head = NULL;

    new_record->tail = NULL;

    new_record->fiber = (void *)fiber;

    new_record->stack_barrier = NULL;

}

A function was added to place and remove the stack barrier:

void 

rb_stack_barrier(void (*yield)(rb_fiber_t *, rb_fiber_t *), rb_fiber_t *new_fiber, rb_fiber_t *old_fiber) {

    rb_stack_barrier_set(&(old_fiber->fiber_record), old_fiber->stack.current);

    rb_stack_barrier_set(&(new_fiber->fiber_record), NULL);

    yield(new_fiber, old_fiber); //transfer control

}


void 

rb_stack_barrier_set(struct fiber_record_struct *fiber_record, void * destination) {

    fiber_record->stack_barrier = destination;

}

The stack barrier of the fiber that is suspended is set to the start of the fiber's stack, and the stack barrier of the resumed fiber is removed. This function is used when fiber_setcontext occurs.

The stack barrier is also removed in rb_fiber_terminate.

rb_fiber_terminate(rb_fiber_t *fiber, int need_interrupt, VALUE error)

{

    VALUE value = fiber->cont.value;


    VM_ASSERT(FIBER_RESUMED_P(fiber));

    rb_fiber_close(fiber);

    rb_stack_barrier_set(&(fiber->fiber_record), NULL);

...

}

Next was creating and marking the fiber record list. Though I tested the functions on the VM stack of the fiber, which acts as the stack for Ruby code, I ultimately only used the fiber record tracking on the fiber's machine stack. There are too many object references on the VM stack, which causes the list itself to take up a great deal of memory.

I added a function to access the fiber record:

struct fiber_record_struct* get_fiber_record(const rb_execution_context_t* ec) {

    rb_fiber_t *fiber = ec->fiber_ptr;

    return &fiber->fiber_record;

}

When the fiber's stack is marked by cont_mark, if there is a machine stack associated with the fiber, rb_fiber_record_mark is called instead of rb_gc_mark_locations. rb_fiber_record_mark was also added to gc.h.

void rb_fiber_record_mark(const rb_execution_context_t *ec, const VALUE *start, const VALUE *end) {

    struct fiber_record_struct *fiber_record = get_fiber_record(ec);

    fiber_record_mark(&rb_objspace, ec, start, end, gc_mark_maybe);

}

The function fiber_record_mark contains the logic for figuring out whether to mark without the fiber record, mark and create a new record, or only mark the record.

void 

fiber_record_mark(rb_objspace_t *objspace, const rb_execution_context_t *ec, const VALUE *start, const VALUE *end,

                 void (*cb)(rb_objspace_t *, VALUE)) 

{

    struct fiber_record_struct *fiber_record = get_fiber_record(ec);


get_fiber_record will return the fiber record associated with the fiber of the execution context. The stack will be scanned entirely if there is no fiber, or if the fiber is active (i.e. stack barrier is NULL). Active fibers will see their stack change, so marking a list with previous GC results may not be accurate. The debug counter fiber_full_stack_scan is also incremented in this situation. If a fiber_record, was created, it is freed, as it is no longer being used.


    if (!fiber_record || fiber_record->stack_barrier == NULL) {

        each_stack_location(objspace, ec, start, end, cb);

        RB_DEBUG_COUNTER_INC(fiber_full_stack_scan);

            

        if (fiber_record) fiber_record_free(fiber_record);

    }

In the other scenario, a fiber is being marked, and it has a stack barrier in place, meaning it is a suspended fiber. When a fiber record has already been created, if a major GC is happening, then the entire stack is scanned. If in a minor GC, then only mark the created fiber record list.

    else {

        //full gc should mark as normal

        //mark everything, but create a list for inactive fibers

        if (fiber_record->head) {

            //inactive with created record

            if (is_full_marking(objspace)) {

                each_stack_location(objspace, ec, start, end, cb);

                RB_DEBUG_COUNTER_INC(fiber_full_stack_scan);

            }

            else {

                fiber_record_mark_list(objspace, fiber_record, cb);

            }

        }

Otherwise, the entire stack has to be scanned, and a new fiber record created.

        else {   

            //inactive without record

            VALUE *stack_start = start;

            VALUE *stack_end = end;

            #if defined(__mc68000__)

                VALUE *stack_start = (VALUE*)((char*)stack_start + 2);

                VALUE *stack_end = (VALUE*)((char*)stack_end - 2);

            #endif


            fiber_record_mark_and_add_locations(objspace, fiber_record, stack_start, stack_end, cb);

            RB_DEBUG_COUNTER_INC(fiber_full_stack_scan);

        }

    }

}

Creating the List

The list is simply created as a linked list with a head and tail, adding new nodes at the tail. In the current implementation, I don't think that a head and tail is necessary. However, I accidentally got sidetracked with creating a tracking list that worked for active fibers as well, which was why I added at the tail to reflect the direction the stack is scanned in. Currently I do not think that it really matters, however.

A dummy object is created at the head to indicate that a list was made.

void

fiber_record_mark_and_add_locations(rb_objspace_t *objspace, struct fiber_record_struct *fiber_record, const VALUE *start, const VALUE *end,

                        void (*cb)(rb_objspace_t *, VALUE))

{

    register long n;

    register VALUE *x = start;


    if (end <= start) return;

    n = end - start;


    //create dummy object at head

    struct fiber_stack_object *new_node = malloc(sizeof(struct fiber_stack_object));

    if (new_node == NULL) return; //no memory

    new_node->stack_obj = NULL;

    new_node->next = NULL;

    fiber_record->head = new_node;

Then, the entire stack is iterated through. If the stack slot is a pointer to the heap, then a node is created for it, and entered into the fiber_record.

   VALUE v;

    while (n--) {

        v = *x;

        if (is_pointer_to_heap(objspace, (void *)v)) {

            struct fiber_stack_object *new_node = malloc(sizeof(struct fiber_stack_object));

            if (new_node == NULL) return; //no memory

            new_node->stack_obj = x;

            new_node->next = NULL;


            if (fiber_record->head) {

                //add to the tail

                if (fiber_record->tail) {

                    //none null

                    fiber_record->tail->next = new_node;

                    fiber_record->tail = new_node;

                }

                else {

                    //tail null

                    fiber_record->head->next = new_node;

                    fiber_record->tail = new_node;

                }

            } 

            else {

                if (fiber_record->tail) {

                    //head is null

                    fiber_record->head = fiber_record->tail;

                    fiber_record->tail->next = new_node;

                    fiber_record->tail = new_node;

                }

                else {

                    //both null

                    fiber_record->head = new_node;

                    fiber_record->tail = new_node;

                    

                }

            }

            RB_DEBUG_COUNTER_ADD(stack_scan_bytes, 8);

        }

The stack_scan_bytes debug counter is incremented as the stack is scanned. Then gc_mark_maybe is called (cb) on the stack slot to mark the object.

        cb(objspace, v);

        x++;

    }

To mark an existing fiber_record, the following function is used:

void

fiber_record_mark_list(rb_objspace_t *objspace, struct fiber_record_struct *fiber_record, void (*cb)(rb_objspace_t *, VALUE)) {

    

    if (fiber_record->head == NULL) return;


    struct fiber_stack_object *current = fiber_record->head->next;

    

    while (current != NULL) {

        cb(objspace, *(current->stack_obj));

        current = current->next;

        RB_DEBUG_COUNTER_ADD(stack_scan_bytes, 8);

    }

}

After checking for NULL pointers, the list is iterated through, and each object is marked with gc_mark_maybe

Finally, to free the fiber_record, this function is used to free the linked list:

void 

fiber_record_free(struct fiber_record_struct *fiber_record) {


    if (fiber_record->head == NULL) return;


    struct fiber_stack_object *current = fiber_record->head;

    

    while (current != NULL) {

        struct fiber_stack_object *removed = current;

        current = current->next;

        free(removed);

        RB_DEBUG_COUNTER_INC(stack_obj_remove);

    }

    fiber_record->head = NULL; 

    fiber_record->tail = NULL; 

}

fiber_record_free is called in fiber_record_mark and cont_free.

Results and Discussion

I ran the benchmark on an unmodified version of Ruby (control) and the version with the stack barrier and fiber record.

To reiterate, the benchmark creates some number of fibers, and then each fiber creates 1000 objects recursively. Then the fibers are yielded, and a major GC, then a minor GC are called. The benchmark does this sequence 15 times.

These are the outputs from both the control version of Ruby and the modified:

fiber gc

The data from 10 runs of the benchmark in both the control version and modified version were collected and graphed, show below:

There seems to be a greater difference in GC time between major and minor GCs in the modified version compared to the control version, particularly in iterations with greater numbers of fibers. For instance, approximately 16.4% difference in time of modified GC #29 and #30, while the control has a 8.2% difference for the same GC invocations. However, GC time can vary quite a bit from run to run of the same program. 

GC time vs fiber count was graphed. The trends generally look the same, though the data points for the highest fiber count are clustered more under the 1000 ms grid line on the modified graph than on the control graph, i.e. the minor GC of the modified version generally take less time than that of the control version. 

The spreadsheet with the full data can also be found here.

The modified version reduced the number of full fiber stack scans occurring in the minor GC. For example, in GC #29 and #30, the control version had fiber_full_stack_scan counts of 16385 both. In comparison, the modified version had counts of 16385 and 1 respectively.

The modified version was able to reduce the number of stack_scan_bytes happening in a minor GC. While the control version had 430349104 bytes scanned for both GC #29, and GC #30, the modified version scanned 430349072 bytes in GC #29 and 395877136 bytes in GC#30, which is an approximately 8.3% difference. This indicates that in the control version, the entire machine stack was still being scanned for roots, while the modified version was able to scan less of the machine stack in a minor GC cycle.

Evaluation and Next Steps

Although the implementation of a stack barrier and fiber record decreased the number of stack bytes scanned and GC time in minor GCs, I believe that to further decrease the GC cost of inactive fibers, the VM stack should be looked at. More memory is allocated for the VM stack, and it also contains much more object references than the machine stack, which only runs native C and has object references when C extensions are in use. There are far more roots present on the VM stack, and reference tracing must be done for each root, which takes time. The stack barrier + fiber record combination would not really help to decrease the scanning of the VM stack, which is mostly object references, and would probably use up too much memory, so alternative methods should be investigated.

The design created in this project should also be tested further with C extensions that create object references on the machine stack for full confidence in its functionality. Due to conflicts between Ruby versions on my Windows system, I was unable to test this aspect in detail within the time frame of GSoC.

As well, pointer chasing with linked lists can result in slowdowns from cache misses, which is a flaw in this design. I also think there is certain redundant code within the implementation of the fiber record and stack barrier that currently that could be removed to further optimize.

Conclusion

I learned a lot over the summer working on this, and I was also met plenty of challenges in learning all the background knowledge to understand the project, and also a lot of difficulties that I would chalk up to me not knowing my way around a computer. I'm extremely grateful for being provided with this learning opportunity, and I would like to thank my mentor Samuel Williams for all his help and guidance.