Technical Overview


The HotPy(2) architecture consists of 5 major components.
  1. Supervisor manges the execution of the other components.
  2. Base interpreter. The standard stack-based interpreter from CPython.
  3. Trace-recording interpreter}. Records the flow of execution, until a loop of some other natural break point is found.  The resulting trace is passed to the optimiser chain.
  4. Optimiser chain Converts the recorded (inefficient) traces into optimised traces.
  5. Trace Manager Stores optimises traces and provides them on request. Also discards infrequently used traces, in order to prevent excessive meory usage.

Execution within the HotPy(2) VM

Execution in HotPy(2) is managed by a Supervisor which delegates actual execution to one of the interpreters (or compiled code if present) to do the actual execution of the program. The Supervisor's role is to coordinate the execution of the other components. Initially the Supervisor calls the Base Interpreter to execute the program. When the Base Interpreter reaches a hot-spot in the program, it returns execution to the Supervisor.

The first time a point in a program becomes hot, the Supervisor passes execution to the Trace-recording Interpreter which records a trace and passes the resulting trace to the optimisers. On second and subsequent executions of the hot-spot, the Supervisor delegates execution to the Trace Manager.

The Trace Manager looks up the previously recorded trace matching the hot-spot and passes it to the Fast Interpreter for execution. When it cannot find a trace for a requested point in the program it returns execution to the supervisor.

Trace Execution

The HotPy(2) execution engine monitors running code for ``hot'' sections, which are run frequently. Execution of these sections is recorded to
form traces, which are linear sequences of bytecodes. The traces are optimised and stored in a trace-cache. The next time a section of code that has been recorded is to be executed, the optimised trace is executed instead. When running, typically over 90% of the execution time is spent executing optimised traces and less than 10% running unoptimised code.

"Fat" and "Thin" Bytecodes

In general, a 'fat' bytecode is one that performs a slow, complex operation, such that the time spent executing the instruction is much larger than the time spent dispatching the instruction and handling the operands. A thin bytecode has the relation between time spent doing useful work and interpreter overhead reversed; the time spent executing the instruction is often less than the time spent dispatching the instruction.

For the purpose of HotPy(2) a 'fat' bytecode is one that can decomposed, in other words in can be implemented as a series of 'thin' bytecodes.

The Trace Recorder

High -level (scripting) languages, like Python, Ruby or Javascript, are implemented with a high-level interpreter which includes many fat bytecodes.

The trace recorder records any thin bytecodes that it encounters. However, when it encounters a fat bytecodes, rather than execute and record the fat bytecode it performs a call into a function written entirely in thin bytecodes that has the same effect.

The resulting trace is thus composed entirely of thin bytecodes suitable for optimisation.
Typically, this tracing of thin bytecodes (without further optimisation) increases the number of instructions executed by a factor of 3 to 6, and increases run-time by 50% to 100%. 

The Optimiser Chain

Type Specialisation

Type specialisation is a known optimisation technique, which takes advantage of the fact that the variables in the execution of a trace (usually in a loop) have the same types from one iteration to the next. Type specialisation allows many operations will look up values in an object's class can be replaced with constants.

Typically, Type Specialisation reduces the number of instructions by about 30% and the run-time by 20%, relative to the unoptimised trace. The specialised trace will still be slower than the original (fat) bytecode.        

Deferred Object Creation

Deferred Object Creation (D.O.C.) is a form of speculative scalar replacement of objects. Python creates a very large number of small short lived objects, D.O.C. defers creating those objects for as long as possible, and by doing so is able to avoid creating many objects entirely, or to create them more efficiently.

Typically, Deferred Object Creation reduces the number of instructions by a factor of 4 to 8 and the runtime by a similar factor,
with a fast stack-based interpreter. For one benchmark (fannkuch) D.O.C. resulted in a 10 fold reduction in instruction count and runtime (although only a 5 fold speedup from the base-line performance).

D.O.C. is so effective because Instruction Thinning and Type Specialisation have created long traces which expose the many redundancies in the Python execution model.

Register Allocator

The register allocator converts the traces from stack-based form to register-based form.

Register-Based Interpreter

The final interpreter in HotPy(2), the one that will execute the optimised traces, is a register-based interpreter.
It is expected that the register-based code will execute about 40% faster than the stack-based code.
This estimate is based on published literature, rather than experimental data from the original HotPy.