6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Oct 05, 2024 4:28 pm

All times are UTC




Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Fri Mar 23, 2018 8:39 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
The Holy Grail of interpreted languages has come to PLASMA. A JIT (Just In Time) compiler that compiles PLASMA bytecode to optimized 6502 machine code on the fly is working in the 2.0 development branch. This was a slightly non-trivial task given the challenges of working on such a limited environment, but it was definitely an interesting exercise. I thought I would share the implementation for your enjoyment.

First off, I'm working on the PLASM 2.0 development branch on GitHub. The relevant file is: https://github.com/dschmenk/PLASMA/blob ... le/jit.pla

The requirement for JIT compiling is a 128K Apple //e or //c. I'll get it working on the Apple /// later, with a 65802 version after that.

The JIT compiler is simply a PLASMA module loaded at boot time that sits as a wedge between the VM and running modules. The VM will invoke the JIT compiler when it matches a set of criteria for a routine, compiling the bytecode routine to native 6502, patching the function call, and calling the newly minted 6502 routine.

Here are the gory details:

When PLASMA loads a module, it creates a stub function for each bytecode routine that simply calls a page 3 vector. The page 3 vector enables the Language Card (where the VM resides) and calls into the VM. The VM extracts the bytecode address from the call address. Because PLASMA can execute bytecode out of Auxiliary, or extended memory, it needs an address for the routine in Main, or global memory. The reason is that any routine can be implemented in machine code (residing in main/global memory) or bytecode (in any meory). But addresses have to be in main/global memory, so the stub provides the main/global address for bytecode routines. With the JIT version of PLASMA, this stub is slightly expanded to contain some runtime statistics and point to a page 3 vector that jumps to the JIT interpreter entrypoint.

The JIT VM has three tuneable parameters it uses for determining when to call the JIT compiler: warmup count, call count, and maximum routine size. During load time, each routine size is calculated and checked against a maximum value. If too big, it is just interpreted. Otherwise, it's stub points to the JIT VM, and it's call count is initialized. Every time a module is loaded, the warmup count is reset in anticipation of initialization code being called. Tuning these three values gives great control of when the JIT compiler will be activated.

When a function is called, it vectors to the VMs JIT interpreter entrypoint that first checks the warmup counter. If not zero, it decrements it. Once that warmup counter reaches zero, it will then count down a per-routine counter. Before the counters reach zero, the VM simpy interprets the bytecode routine like normal. Once the per-routine counter reaches zero, the VM calls the JIT compiler.

The JIT compiler is given a buffer (about 4K currently) that is set aside at boot time for the compilation destination. Once it is full, all other routines passed to the JIT entrypoint are simply patched to call the normal interpreter entrypoint, never to bother the JIT compiler again. The JIT compiler has to do a little housekeeping before jumping into to compiling the bytecodes. First, the bytecode routine is copied out of Auxiliary RAM into a temp buffer. Second, the routine is scanned, looking for all the branch destinations. This is required so that later compiler optimizations know where the optimization fences are. It also has to mark where data is while scanning executable code. Once the housekeeping is done, the fun starts.

There are three type of optimizations done by the compiler to improve the quality of generated code. First, the Y register is loaded with zero at the beginning of the routine and used often for filling MSBytes of byte values. Certain operations will temporarilly use Y, but quickly reload it with zero. The 65C02 CPU does have some handy STZ opcodes for storing zero to memory, but the 6502 doesn't. It turns out that using Y for zero has a lot of uses and a significant performance enhancement. The bytecode interpreter uses Y as a LSByte for the Emulated Instruction Pointer, but that isn't required for native code. I don't see any reason to make a 65C02 specific version of the compiler.

Second, the Evaluation Stack Pointer is virtualized. The X register contains the ESP, and in normal operation is incremented and decremented as values are pushed and popped. This operation can be virtualized and the address of the stack can be manipulated to access the stack values. Only when optimization fences or control flow changes does the virtual stack pointer and the X register need to be synced. This saves a great deal of X register thrashing that is so common in stack based VMs.

Third, the Accuulator caches the LSByte of the Top-Of-Evaluation-Stack. Often, the LSByte of the TOS never has to be fetched from or stored to the stack. Especially when operating on BYTE sized memory values, the compiler is particularly efficient.

Lastly, the instruction set and compiler for PLASMA have significanlty improved for version 2.0. Some of these improvements were with native code generation in mind, even while helping interpreter performance.

There is still a lot to do. The JIT compiler module is about 14K which is pretty big. I haven't spent any time working on size, getting it functional was hard enough. Now though, I have a target of less than 8K which I think that is doable. I believe the Apple II will have a JIT enabled VM for 128K machines, and the regular VM for all machines. The Apple /// will always have the JIT compiler integrated once 2.0 releases. And then I have to experiment with the 65802/65816 version. That should prove enlightening.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 25, 2018 9:55 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
Bravo! (That JIT compiler module - is that written in PLASMA? How much assembly language is needed in the whole setup?)


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 25, 2018 2:37 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
BigEd wrote:
Bravo! (That JIT compiler module - is that written in PLASMA? How much assembly language is needed in the whole setup?)


Ed -
Great question. Surprisingly, only a small amount of assembly is needed: A JIT specific entry point into the VM to check the warmup count and per-routine count in order to determine if the JIT compiler should be called for the routine. Literally a handful of code. The rest is all PLASMA (the file linked at the top). I'm in the process of re-implementing the way native code is written into the buffer to be more space & time efficient. I'll clean it up when it's all debugged. It's very tedious, the equivalent of hand assembling. One mistake takes quite a bit of time to track down. But, here is what the code to compile the ISGE - "Is Greater than or Equal" opcode looks like (cleaned up from the current source):

Code:
          when opcode

        ...

               is $48 // ISGE
                    if A_IS_TOSL & TOSL_DIRTY
                        *codeptr = $D095+(VX<<8) // STA ESTKL,X
                        codeptr  = codeptr + 2
                    fin
                    codeptr=>0  = ($D0B5+$0100)+(VX<<8)  // LDA ESTKL+1,X
                    codeptr=>2  = $D0D5+(VX<<8)          // CMP ESTKL,X
                    codeptr=>4  = ($C0B5+$0100)+(VX<<8)  // LDA ESTKH+1,X
                    codeptr=>6  = $C0F5+(VX<<8)          // SBC ESTKH
                    codeptr=>8  = $0250                  // BVC +2
                    codeptr=>10 = $8049                  // EOR #$80
                    codeptr=>12 = $0130                  // BMI +1
                    codeptr=>14 = $9888                  // DEY; TYA                   
                    codeptr=>16 = ($C094+$0100)+(VX<<8)  // STY ESTKH+1,X
                    codeptr=>18 = $00A0                  // LDY #$00
                    VX++                                 // INX
                    codeptr     = codeptr + 20
                    A_IS_TOSL = TOSL_DIRTY
                    break


I'm using word-sized values to make filling the code buffer more efficient, but harder to understand. You may ask why I'm not using a template based approach, and copy a pre-existing code sequence. It has to do with the data footprint in main memory. All data takes up valuable space in the main memory section, whereas byte code gets to sit in auxiliary memory. So I'm coding the sequence as a series of writes. However, if I can't get the compiler code size down enough, I may have to find a way to move to a template approach and keep the template sequences out of main memory.

In the above sequence, you can see the three optimization approached in action. The A_IS_TOSL flag will track the cache state of the LSByte of Top-Of-Stack in the Accumulator. On entry, it checks if it needs to write-back the TOSL. Some ops require this if the Accumulator needs to fetch a different value. Others can just use the Accumulator as-is if they were going to read the TOS LSByte anyway. At the end of the op, we leave the TOSL in the Accumulator and set the flag indicating the TOSL is dirty and will need to be (possibly) written back.

Another optimization is to leave zero in the Y register. However, with this op, Y is used to create the logical value for whether the TOS and NOS is Greater than or Equal. But, we immediately set Y to zero when done with it. Notice that I combined the DEY and TYA into one word sized write. PITA to debug this when I get it wrong.

Finally you can see where the Virtual X register is used to adjust the address of the evaluation stack instead of actually updating the X register. Makes for a little uglier code (and of course harder to debug), but much more efficient. Originally, I left all these optimizations out to debug each opcode. Only when they worked as anticipated did I move to the optimized versions, one step at a time.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 25, 2018 2:47 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
Very interesting!


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 28, 2018 10:45 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
After killing the last bug with the JIT, time to test out what it can really do.

As a quick review, the JIT compiler will recompile the PLASMA bytecode into ~optimized 6502 machine code. The JIT compiler source is cleaned up quite a bit. Worth checking out: https://github.com/dschmenk/PLASMA/blob ... le/jit.pla

There are three tunable parameters used by the VM to determine when to let the compiler take a swing at the bytecoded routine: Initialization warmup, runtime function call count, and function size. Another parameter isn't tuneable; the VM has to be built with the size of the destination codebuffer fixed-in. I wrote a little utility to adjust the tunable parameters from the command line so the effects can be easily measured without having to rebuild anything.

For my testing, I chose the PLASMA compiler, PLASM, and measured the time it took to compile the SIEVE.PLA benchmark. It is a pretty small program, but I measured the time from pressing enter to when the prompt returned. There is a lot of time taken to just load the module and dynamically link it which is the same regardless. Nevertheless, it was easier to measure the total time. I first did a baseline measurement with the standard bytecode interpreted PLASMA VM, then ran through a series of values to fid the best performance.

Test: +PLASM SIEVE.PLA

Baseline PLASMA VM: 18.8 seconds (average of 3 runs)

For the JIT compiler, the values iterated through were:
Warmup count: 0, 32, 64, 128, 1024
Then call count: 1, 8, 16, 32, 64, 96
Then max size: 256, 128, 64

Since formatting the tabulated data will probably be a disaster, let me summerize the results. By far, the most important parameter is the call count. The warmup count only had small effect, and it was detrimental (it just slowed things down). Here are the best times for the combination of warmup and call counts with size at 256:

Code:
Warmup   Call   Time
------   ----   ----
32       1      24.23
32       8      20.30
32       16     19.43
0        32     17.14
0        64     16.91
0        96     17.78

Then I went tried max size of 128 and 64 using a warmup count of 0:
Code:
Call   Size  Time
----   ----  ----
1      128   22.8
32     128   17.1
64     128   16.97
96     128   17.76
32     64    19.7
64     64    19.25
96     64    19.25

Then, in search of a minima, I went through these values for call count with warmup of 0 and size of 128:
Code:
Call  Time
----  ----
40    17.23
44    16.48
45    16.44
46    16.51
48    16.56
50    16.65
52    16.73
56    16.57

These tests were all with a 6K code buffer. I rebuilt the VM with a 4K code buffer and got the same results.

With these values: warmup=0, call=45, size=128 I compiled a much larger program - the RPN calculator. It gives the compiler a nice workout. Again, the baseline and the 6K codebuffer version and 4K codebuffer version:

Test: +PLASM RPNCALC.PLA
Code:
Baseline PLASMA VM: 3:37.48
6K JIT VM: 1:57.92
4K JIT VM: 1:59.02

A 45% speed improvement for a 4K main memory codebuffer and a <10K JIT compiler (in aux memory) trade off. Not bad.

Of course there is much work to do. I'm still working out some of my optimization strategies involving the Y register. Might get a smidge better performance/density, but probably just a few percent best case. I'd also like to get the compiler closer to 8K in size.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 29, 2018 12:33 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
I'd say that's an 80% speedup, which is even better!


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 29, 2018 2:33 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
BigEd wrote:
I'd say that's an 80% speedup, which is even better!

Ed - you must be in marketing!

I was pleasantly surprised by the improvement. Especially given that I had already rewritten the lowest level of the lexer (ID tokenizer and keyword matcher) in assembly. The adage that 90% of the time of a program is spent in 10% of the code rings true.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 29, 2018 5:29 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
It's often surprising where the time goes - do you have any sort of profiler? Any way of sampling the call stack?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 30, 2018 2:07 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
BigEd wrote:
It's often surprising where the time goes - do you have any sort of profiler? Any way of sampling the call stack?


At one point I was keeping a running tab on the number of times an opcode was executed. But for PLASMA 2.0 I employed a static analysis to hunt for common opcode sequences to combine. Too much to maintain.

One thing PLASMA is sorely lacking though, is a debugger.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 30, 2018 10:45 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
After much code crunching, I have the JIT compiler down to an 8K module. By combining the code generation for similar opcodes, it all meets my original goal, barely. Seriously, it's 8K - 8 bytes in size. And I also re-worked the optimization strategy for the Y register. Now, its value is tracked, with a pre-disposotion to getting to to zero, but not forcing it. The Y register gets thrashed often when loading and saving local variables to the frame stack, but the local variable with offset zero is far ana away the most common case. And zero is handy for filling the MSB of byte values. So the code generation is a little improved, and about as goood as one can expect from an 8K compiler. Perhaps I'll write up the technique with some examples.

If you would like to try it out, I've updated the disk images. On your 128K Apple //e or //c, launch the PLAJIT.SYSTEM (I need a better name, maybe IJIT.SYSTEM) program and you will see the JIT Compiler loaded message. I've since removed one of the tuneable parameters since it didn't seem to do any good and made the VM entrypoint ugly. The default values are best for running real programs like the compiler and editor. But for benchmarks, they have to be tweaked. Run '+JITUNE 1' to set the call count to 1 before calling the JIT compiler. This basically compiles every routine it can, the first time it sees it. Here are a few things to try:

Build and run Rod's Colors
Code:
P BLD
+JITUNE 45 255 - Only if you've already played around with the default values
+PLASM -O2 ROD.PLA
+ROD
+JITUNE 1
+ROD


Notice the difference in speed between the two instances of running ROD.

Check compile speed
Code:
P BLD
+JITUNE 45 0 - Effectively disable JIT compilation for any routine (max size = 0)
+PLASM RPNCALC.PLA - Time the duration (takes awhile)
+JITUNE 45 255 - Reenable default JIT values
+PLASM RPNCALC.PLA - Compare time with above

It's interesting running the JIT compiler on the PLASMA compiler, PLASM. I had previously rewritten the lowest level of the compiler, the part that parses the variable names and keywords, in assembly for a nice performance improvement. Even with that optimization in place, the JIT compiler is able to extract almost 2X the performance from the remaining bytecode.

The other interesting benchmark is SIEVE.PLA. Note that the source code included prints out all te primes as they're calculated. If you remove the prints statements in the editor (CTRL-X on the line containing PUTI and PUTLN), then compile with: '+PLASM -O2 SIEVE.PLA' you can see the big difference the JIT compiler can make on pure PLASMA code (don't forget '+JITUNE 1'). About 4X by my measurements.

Download the ProDOS disk images here: https://github.com/dschmenk/PLASMA/tree/devel

Note: this is the version 2 devel branch, so the image names have been update, i.e. PLASMA-SYS2.PO

Everything else from Version 1 applies here.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 31, 2018 8:24 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
A first stab at describing the PLASMA JIT compilation process: https://github.com/dschmenk/PLASMA/wiki ... ementation


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 31, 2018 9:32 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10949
Location: England
That's great - documentation is icing on the cake!


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 13, 2018 1:45 am 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
The PLASMA JIT compiler for the 6502 has now been joined by a 65802/65816 JIT compiler. PLASMA bytecode modules are compiled on-the-fly into the proper machine code depending on the CPU. The JIT compilers are now available for hte Apple II, Apple III, and IIGS/65802 upgraded machines.

I haven't done any performance measurements on the 65802, yet. The generated code is much more efficient than the 6502, as the PLASMA VM is a 16 bit architecture that maps well to the 65802. The only mismatch is when entering, returning, or calling routines. The evaluation stack in Zero Page has to be mapped to the 65802 hardware stack. As a result, performance will be affected by the number of function calls made.

The JIT compilers are available here for your perusal:

6502 core - https://github.com/dschmenk/PLASMA/blob ... itcore.pla

65802 core - https://github.com/dschmenk/PLASMA/blob ... 16core.pla

This pretty much wraps up the new technology for PLASMA 2.0. Now time to work through all the libraries and sample code to make sure it works with the JIT compilers and architectural updates. And port to more machines. Steve F is working on updating the Beeb port and there should be a C64 version, too.

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Sat Apr 14, 2018 3:47 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
Here are some quick resuls for the latest PLASMA VM and JIT compiler optiizations. I moved to testing on an Apple II GS (for the 65816 CPU) running at 1 MHz. I'm only interested in relative times, and running at 1 MHz should make my wall clock timing more accurate and somewhat comparable to the previous timings from an Apple //e. My Apple IIGS does have a CFFA 3000, so the I/O overhead is about as minimal as you can get on real hardware.

The test run is the PLASM compiler compiling the RPNCALC.PLA source - a reasonably sophisticated program. This used to be a painful exercise on real, unaccelerated hardware.

The two VMs, the 65802/65816 specific VM, and the 6502/65C02 specific VM are timed, both with the JIT compiler enabled with default values and with the JIT compiler effectively disabled by setting the maximum JIT compileable routine size to zero.

Code:
Test: +PLASM RPNCALC.PLA

6502  No JIT: 95.45 s (baseline)
6502     JIT: 51.41 s (46% improvement over 6502  No JIT)
                      (37% improvement over 65802 No JIT)
                     
65802 No JIT: 81.52 s (15% improvement over 6502  No JIT)
65802    JIT: 41.50 s (49% improvement over 65802 No JIT)
                      (57% improvement over 6502  No JIT)
                      (19% improvement over 6502     JIT)


What I like about this test is that it's a real-word test. It includes I/O overhead and wasn't tuned to be a benchmark. Whatever the JIT compiler deemed to be important is what got optimized without any input from me (except for the default JIT parameters which I previously determined experimentally).

The same JIT parameters are used for both the 6502 and 65802 JIT compilers. I did a quick test changing the values for the 65802 JIT compiler, but it seemed to like the same values used for the 6502 JIT compiler. And this with a completely different machine code architecture - the 65802 version uses the hardware stack for evaluation and generates much smaller code. I'm guessing the code buffer can fit more compiled routines due to its higher code density. The 65802 does have to do more work to map the Zero Page evaluation stack to the hardware stack on entry/exit, though.

I'm still blown away by how small a code buffer can make such an improvement. On my 8 MHz Apple //c with 65802 and 1 MB RAMDISK, the compiles fly. Feels like Turbo Pascal vs Apple Pascal.

I've updated the disk images with the latest VMs. When booting the PLASMA-SYS2.PO image, it will automatically detect the system and launch the best VM for the hardware: https://github.com/dschmenk/PLASMA

Lastly, I've updated the bytecode definition chart to Version 2.0. You can see how the bytecodes get executed in the VMs and how the JIT compilers convert them the machine code: https://github.com/dschmenk/PLASMA/wiki ... Byte-Codes

Dave...


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 13, 2019 1:01 am 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 145
Location: Lake Tahoe
After a year-and-a-half on languishing, I spent some time tracking down the last known JIT bug: a module load/unload interaction with JIT compiled routines. A serious pain to track down, but it seemed like a good excuse to build a Developer Release: https://github.com/dschmenk/PLASMA/blob ... n%202.0.md

There's even a work-in-progress PLASMA programming series on YouTube that has all the production values of Public Access T.V.: https://www.youtube.com/watch?v=_JEqbcz ... f4SP2Gw3yU

Dave...


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 20 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: