Making bytecode interpreter faster (without eating more ROM)

Programming the 6502 microprocessor and its relatives in assembly and other languages.
Post Reply
Bregalad
Posts: 149
Joined: 27 Mar 2010
Location: Chexbres, VD, Switzerland
Contact:

Making bytecode interpreter faster (without eating more ROM)

Post by Bregalad »

I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:

The old way is:
  • Fetch an instruction from ROM
  • Read jump table and jump to instruction's address
  • The instruction's code is executed from ROM
  • Increment the bytecode pointer
  • Jump back into interpreter
The faster way would be:
  • Fetch an instruction from ROM
  • Read the "jump table", but copy the target's content to RAM buffer instead of executing it
  • Increment the bytecode pointer
  • Repeat for all instructions in a basic bloc
  • JSR to the RAM buffer, which will execute multiple instructions at once
We spend less time decoding instructions and jumping, but more time copying to RAM. In order to make it profitable, it is required that a large number of instructions are decoded at once.

The problem is for jump, calls and branches instructions (thus the notion of a basic bloc). The immediate solution is to stop the loop I describe above at the first of such instructions, and interpret them "normally".
Unfortunately, a basic block will be 3-5 instructions long in most cases, this render this whole idea useless.

So the other idea would be to include at least branches in the "basic block" part, so only jumps and calls would need to exit the RAM buffer and return to the interpreter. The only idea I'd have would be to make all instruction's size directly proportional to their bytecode's size (for example 1 byte instructions are 8 bytes interpreter, 2 bytes instructions 16 bytes and so on), so that the branch offsets can be shifted left a couple of time to become the "real" branch offsets.
This sounds like a cool idea, but seriously limits what the interpreter can do, sometimes there will not be enough space for instructions you'd like to implement, and sometimes you'll have to waste ROM space with NOPs (and thus also making interpreted code slower) in order to fulfill that condition.

So the only solution that is practical would be to know at compile time how much bytes each instruction takes, and account for them in the branches, so that the branch offset would refer to 6502 instructions and not to bytecode itself. Needless to say, the portability and maintainability of the byte code machine seriously suffers with this approach.
User avatar
BigDumbDinosaur
Posts: 9425
Joined: 28 May 2009
Location: Midwestern USA (JB Pritzker’s dystopia)
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by BigDumbDinosaur »

Bregalad wrote:
I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:

The old way is:
  • Fetch an instruction from ROM
  • Read jump table and jump to instruction's address
  • The instruction's code is executed from ROM
  • Increment the bytecode pointer
  • Jump back into interpreter
The faster way would be:
  • Fetch an instruction from ROM
  • Read the "jump table", but copy the target's content to RAM buffer instead of executing it
  • Increment the bytecode pointer
  • Repeat for all instructions in a basic bloc
  • JSR to the RAM buffer, which will execute multiple instructions at once
I really don't see where you would be gaining a great deal with the second procedure. You still have to have the overhead of "translating" the tokenized program text to the appropriate machine instructions. The logical place to do that is during compile-time, not run-time.

Something that might be worth studying is how it is done in timesharing BASICs, such as BBx and Thoroughbred. These implementations are noted for rapid execution, despite being (nominally) interpreted languages. The secret is that almost all of the interpretation occurs as statements are entered and "compiled" (tokenized). The run-time engine uses sorted look-up tables for variable names and branch/jump/subroutine targets, which tables are generated as the program is being written and tokenized. Hence during run-time, a statement such as GOSUB PRINT_STATEMENT does not incur the overhead of a sequential search of the program text for the subroutine PRINT_STATEMENT.

An adjunct to these optimizing techniques is the use of labels as branch/jump/subroutine targets instead of line numbers. Line numbers are permitted, of course, but use of labels substantially improves execution speed, since the symbol table that is attached to the program points directly to the address of the labeled statement. Hence much less complex evaluation is required as statements are executed.

Another optimization technique that is used is the evaluation of mathematical statements during program entry. The console mode engine actually breaks down a mathematical statement into its individual terms and internally rearranges the expression into the order in which it must be solved. Hence this "labor-intensive" step is taken out of the program at run-time.

So in a sense, BBx or Thoroughbred BASIC programs are mostly interpreted as they are generated, rather than as they run, greatly improving performance.
x86?  We ain't got no x86.  We don't NEED no stinking x86!
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by BigEd »

Bearing in mind that most execution time probably happens in loops, and loops are defined by backwards branches, I think there's some hope here. As you say, the native code you run will need to branch back with a 6502 branch and so you'll need to compute the offset in 6502 bytes: but you just processed the basic block as a unit, and can keep track during that of how long each instruction was. Slightly easier, take two passes: on the second pass, you know where the branch targets are. You can also cache and reuse native code for basic blocks you've visited before.

[Edit: it's normal to start by interpreting everything but collect profiling information, and only apply the JIT transformation to hot spots. The profiling can also collect the data about the starts and ends of basic blocks, and destinations of branches.]
Last edited by BigEd on Wed May 13, 2015 8:15 am, edited 1 time in total.
White Flame
Posts: 704
Joined: 24 Jul 2012

Re: Making bytecode interpreter faster (without eating more

Post by White Flame »

A simple JIT scheme like that is equivalent to performing code compression/decompression, with the VM bytecode as the compression dictionary. Maybe other compression schemes would be more useful?

If you use a generalized byte-level compression scheme, the ROM routines for performing the decompression would likely also be smaller, compared to having per-instruction interpreter bodies in ROM. I'm not ultra familiar with how well hand-rolled 6502 compresses in traditional schemes, though.

One hybrid scheme would be to have regular 6502 intermixed with escape codes which will substitute in JSRs or larger blocks of code. That way your loops can still be tight comparisons on flags & register values in plain 6502, and boilerplate code can be reduced in application-specific ways.
Bregalad
Posts: 149
Joined: 27 Mar 2010
Location: Chexbres, VD, Switzerland
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by Bregalad »

In my previous thread about program compression, the common consensus was that using a VM is the best way to compress a program. I feel the smell of an infinite loop :)
Quote:
I really don't see where you would be gaining a great deal with the second procedure. You still have to have the overhead of "translating" the tokenized program text to the appropriate machine instructions. The logical place to do that is during compile-time, not run-time.
The great advantage is if there is a loop, the bytecode is passed through only once, and not N times, thus a potential gain in speed. If it isn't for loops, there is no possible gains in such a scheme.
Quote:
Slightly easier, take two passes: on the second pass, you know where the branch targets are
Interesting idea, however, I don't quite get it. What work do you do exactly on the 1st and second pass?
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by BigEd »

Bregalad wrote:
Quote:
Slightly easier, take two passes: on the second pass, you know where the branch targets are
Interesting idea, however, I don't quite get it. What work do you do exactly on the 1st and second pass?
Great question. I didn't actually think very hard about that. I was thinking that it would help to run through a first pass without generating code but just accounting for lengths of code and making a list of destinations. Then the second pass knows the native addresses for the destinations, at the time that it's building the code for the branches and jumps.

There are different approaches to JITting, one of which is 'tracing JIT' - that might be close to what I was thinking. But I haven't studied these, far less implemented them - just read about them.

Sorry not to be very helpful
Ed
White Flame
Posts: 704
Joined: 24 Jul 2012

Re: Making bytecode interpreter faster (without eating more

Post by White Flame »

To expand on my idea above, since the primary goal is speed, and the secondary is size, I think having inline non-bytecoded 6502 code would be important. You could elide bytecode implementations for many small or peephole-optimizable instructions, hopefully reducing your ROM footprint. If the expanded bytecode ABI left the N/V/Z/C flags for 6502 branch instructions to see, that would be quite fast during runtime, and would eliminate implementations for an entire class of instructions.

Another idea for the JITter would be to merge your bytecode with 6502 opcodes. This way, the inline 6502 doesn't need any escaped headers; the bytecode would just take over some of the 6502 instruction space. This might make the code decompressor a little bigger (have to know or calculate opcode/length pairs for the 6502 instructions), but would make all your scripts smaller assuming inlining.

The inline 6502 code would ideally be assembled such that the branch targets match the expanded JIT code, to avoid run-time fixup. That should be possible using simple assembler macros.
oullr
Posts: 4
Joined: 06 May 2015

Re: Making bytecode interpreter faster (without eating more

Post by oullr »

A relatively simple technique to accelerate VM execution by -say- 20-50% is Direct Threading (see https://en.wikipedia.org/wiki/Threaded_ ... _threading). It fattens the code a bit, though.
bogax
Posts: 250
Joined: 18 Nov 2003

Re: Making bytecode interpreter faster (without eating more

Post by bogax »

Implement [something like] local labels in your byte code interpreter.
jgharston
Posts: 181
Joined: 22 Feb 2004

Re: Making bytecode interpreter faster (without eating more

Post by jgharston »

Bregalad wrote:
I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:
The old way is:
  • Fetch an instruction from ROM
  • Read jump table and jump to instruction's address
  • The instruction's code is executed from ROM
  • Increment the bytecode pointer
  • Jump back into interpreter
You can speed that up by making the tail of the instruction code the fetch-and-decode code, then you get:
  • Fetch an instruction from ROM
  • Read jump table and jump to instruction's address
  • The instruction's code is executed from ROM
  • Increment the bytecode pointer
  • Fetch an instruction from ROM
  • Read jump table and jump to instruction's address
etc.

It depends on the implementation whether it takes more code or not. Any optimisation is a balancing between speed and size.
Bregalad
Posts: 149
Joined: 27 Mar 2010
Location: Chexbres, VD, Switzerland
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by Bregalad »

Well, there is no need to answer me anymore, I've decided that it was too complicated anyway and I would not go to this direction.

But yes, doing classic optimisations (by "classic" I mean without changing everything) to the interpreter is probably the best way to make it faster, especially considering optimising it for the most used instructions should be enough.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Making bytecode interpreter faster (without eating more

Post by Druzyek »

Quote:
So in a sense, BBx or Thoroughbred BASIC programs are mostly interpreted as they are generated, rather than as they run, greatly improving performance.
Just out of curiosity, if you have enough RAM and storage space on a modern machine, why even interpret at all? Wouldn't it make more since to just generate machine code if you're generating anything?
User avatar
barrym95838
Posts: 2056
Joined: 30 Jun 2013
Location: Sacramento, CA, USA

Re: Making bytecode interpreter faster (without eating more

Post by barrym95838 »

Druzyek wrote:
Just out of curiosity, if you have enough RAM and storage space on a modern machine, why even interpret at all? Wouldn't it make more since to just generate machine code if you're generating anything?
Well-written machine (code) is impossible to beat for speed, but can be easily beaten in density by an interpreter on many architectures, including 65xx. It's finding the right balance for the particular application which can be a bit irksome.

Mike B.
Post Reply