The Holy Grail of interpreted languages has come to PLASMA. A JIT (Just In Time) compiler that compiles PLASMA bytecode to optimized 6502 machine code on the fly is working in the 2.0 development branch. This was a slightly non-trivial task given the challenges of working on such a limited environment, but it was definitely an interesting exercise. I thought I would share the implementation for your enjoyment.
First off, I'm working on the PLASM 2.0 development branch on GitHub. The relevant file is:
https://github.com/dschmenk/PLASMA/blob ... le/jit.plaThe requirement for JIT compiling is a 128K Apple //e or //c. I'll get it working on the Apple /// later, with a 65802 version after that.
The JIT compiler is simply a PLASMA module loaded at boot time that sits as a wedge between the VM and running modules. The VM will invoke the JIT compiler when it matches a set of criteria for a routine, compiling the bytecode routine to native 6502, patching the function call, and calling the newly minted 6502 routine.
Here are the gory details:
When PLASMA loads a module, it creates a stub function for each bytecode routine that simply calls a page 3 vector. The page 3 vector enables the Language Card (where the VM resides) and calls into the VM. The VM extracts the bytecode address from the call address. Because PLASMA can execute bytecode out of Auxiliary, or extended memory, it needs an address for the routine in Main, or global memory. The reason is that any routine can be implemented in machine code (residing in main/global memory) or bytecode (in any meory). But addresses have to be in main/global memory, so the stub provides the main/global address for bytecode routines. With the JIT version of PLASMA, this stub is slightly expanded to contain some runtime statistics and point to a page 3 vector that jumps to the JIT interpreter entrypoint.
The JIT VM has three tuneable parameters it uses for determining when to call the JIT compiler: warmup count, call count, and maximum routine size. During load time, each routine size is calculated and checked against a maximum value. If too big, it is just interpreted. Otherwise, it's stub points to the JIT VM, and it's call count is initialized. Every time a module is loaded, the warmup count is reset in anticipation of initialization code being called. Tuning these three values gives great control of when the JIT compiler will be activated.
When a function is called, it vectors to the VMs JIT interpreter entrypoint that first checks the warmup counter. If not zero, it decrements it. Once that warmup counter reaches zero, it will then count down a per-routine counter. Before the counters reach zero, the VM simpy interprets the bytecode routine like normal. Once the per-routine counter reaches zero, the VM calls the JIT compiler.
The JIT compiler is given a buffer (about 4K currently) that is set aside at boot time for the compilation destination. Once it is full, all other routines passed to the JIT entrypoint are simply patched to call the normal interpreter entrypoint, never to bother the JIT compiler again. The JIT compiler has to do a little housekeeping before jumping into to compiling the bytecodes. First, the bytecode routine is copied out of Auxiliary RAM into a temp buffer. Second, the routine is scanned, looking for all the branch destinations. This is required so that later compiler optimizations know where the optimization fences are. It also has to mark where data is while scanning executable code. Once the housekeeping is done, the fun starts.
There are three type of optimizations done by the compiler to improve the quality of generated code. First, the Y register is loaded with zero at the beginning of the routine and used often for filling MSBytes of byte values. Certain operations will temporarilly use Y, but quickly reload it with zero. The 65C02 CPU does have some handy STZ opcodes for storing zero to memory, but the 6502 doesn't. It turns out that using Y for zero has a lot of uses and a significant performance enhancement. The bytecode interpreter uses Y as a LSByte for the Emulated Instruction Pointer, but that isn't required for native code. I don't see any reason to make a 65C02 specific version of the compiler.
Second, the Evaluation Stack Pointer is virtualized. The X register contains the ESP, and in normal operation is incremented and decremented as values are pushed and popped. This operation can be virtualized and the address of the stack can be manipulated to access the stack values. Only when optimization fences or control flow changes does the virtual stack pointer and the X register need to be synced. This saves a great deal of X register thrashing that is so common in stack based VMs.
Third, the Accuulator caches the LSByte of the Top-Of-Evaluation-Stack. Often, the LSByte of the TOS never has to be fetched from or stored to the stack. Especially when operating on BYTE sized memory values, the compiler is particularly efficient.
Lastly, the instruction set and compiler for PLASMA have significanlty improved for version 2.0. Some of these improvements were with native code generation in mind, even while helping interpreter performance.
There is still a lot to do. The JIT compiler module is about 14K which is pretty big. I haven't spent any time working on size, getting it functional was hard enough. Now though, I have a target of less than 8K which I think that is doable. I believe the Apple II will have a JIT enabled VM for 128K machines, and the regular VM for all machines. The Apple /// will always have the JIT compiler integrated once 2.0 releases. And then I have to experiment with the 65802/65816 version. That should prove enlightening.
Dave...