Making bytecode interpreter faster (without eating more ROM)

Bregalad · Post by **Bregalad** » Tue May 12, 2015 2:16 pm

I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:

The old way is:

Fetch an instruction from ROM
Read jump table and jump to instruction's address
The instruction's code is executed from ROM
Increment the bytecode pointer
Jump back into interpreter

The faster way would be:

Fetch an instruction from ROM
Read the "jump table", but copy the target's content to RAM buffer instead of executing it
Increment the bytecode pointer
Repeat for all instructions in a basic bloc
JSR to the RAM buffer, which will execute multiple instructions at once

We spend less time decoding instructions and jumping, but more time copying to RAM. In order to make it profitable, it is required that a large number of instructions are decoded at once.

The problem is for jump, calls and branches instructions (thus the notion of a basic bloc). The immediate solution is to stop the loop I describe above at the first of such instructions, and interpret them "normally".
Unfortunately, a basic block will be 3-5 instructions long in most cases, this render this whole idea useless.

So the other idea would be to include at least branches in the "basic block" part, so only jumps and calls would need to exit the RAM buffer and return to the interpreter. The only idea I'd have would be to make all instruction's size directly proportional to their bytecode's size (for example 1 byte instructions are 8 bytes interpreter, 2 bytes instructions 16 bytes and so on), so that the branch offsets can be shifted left a couple of time to become the "real" branch offsets.
This sounds like a cool idea, but seriously limits what the interpreter can do, sometimes there will not be enough space for instructions you'd like to implement, and sometimes you'll have to waste ROM space with NOPs (and thus also making interpreted code slower) in order to fulfill that condition.

So the only solution that is practical would be to know at compile time how much bytes each instruction takes, and account for them in the branches, so that the branch offset would refer to 6502 instructions and not to bytecode itself. Needless to say, the portability and maintainability of the byte code machine seriously suffers with this approach.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue May 12, 2015 4:13 pm

Bregalad wrote:

I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:

The old way is:

Fetch an instruction from ROM
Read jump table and jump to instruction's address
The instruction's code is executed from ROM
Increment the bytecode pointer
Jump back into interpreter

The faster way would be:

Fetch an instruction from ROM
Read the "jump table", but copy the target's content to RAM buffer instead of executing it
Increment the bytecode pointer
Repeat for all instructions in a basic bloc
JSR to the RAM buffer, which will execute multiple instructions at once

I really don't see where you would be gaining a great deal with the second procedure. You still have to have the overhead of "translating" the tokenized program text to the appropriate machine instructions. The logical place to do that is during compile-time, not run-time.

Something that might be worth studying is how it is done in timesharing BASICs, such as BBx and Thoroughbred. These implementations are noted for rapid execution, despite being (nominally) interpreted languages. The secret is that almost all of the interpretation occurs as statements are entered and "compiled" (tokenized). The run-time engine uses sorted look-up tables for variable names and branch/jump/subroutine targets, which tables are generated as the program is being written and tokenized. Hence during run-time, a statement such as GOSUB PRINT_STATEMENT does not incur the overhead of a sequential search of the program text for the subroutine PRINT_STATEMENT.

An adjunct to these optimizing techniques is the use of labels as branch/jump/subroutine targets instead of line numbers. Line numbers are permitted, of course, but use of labels substantially improves execution speed, since the symbol table that is attached to the program points directly to the address of the labeled statement. Hence much less complex evaluation is required as statements are executed.

Another optimization technique that is used is the evaluation of mathematical statements during program entry. The console mode engine actually breaks down a mathematical statement into its individual terms and internally rearranges the expression into the order in which it must be solved. Hence this "labor-intensive" step is taken out of the program at run-time.

So in a sense, BBx or Thoroughbred BASIC programs are mostly interpreted as they are generated, rather than as they run, greatly improving performance.

BigEd · Post by **BigEd** » Tue May 12, 2015 6:15 pm

Bearing in mind that most execution time probably happens in loops, and loops are defined by backwards branches, I think there's some hope here. As you say, the native code you run will need to branch back with a 6502 branch and so you'll need to compute the offset in 6502 bytes: but you just processed the basic block as a unit, and can keep track during that of how long each instruction was. Slightly easier, take two passes: on the second pass, you know where the branch targets are. You can also cache and reuse native code for basic blocks you've visited before.

[Edit: it's normal to start by interpreting everything but collect profiling information, and only apply the JIT transformation to hot spots. The profiling can also collect the data about the starts and ends of basic blocks, and destinations of branches.]

White Flame · Post by **White Flame** » Wed May 13, 2015 1:34 am

A simple JIT scheme like that is equivalent to performing code compression/decompression, with the VM bytecode as the compression dictionary. Maybe other compression schemes would be more useful?

If you use a generalized byte-level compression scheme, the ROM routines for performing the decompression would likely also be smaller, compared to having per-instruction interpreter bodies in ROM. I'm not ultra familiar with how well hand-rolled 6502 compresses in traditional schemes, though.

One hybrid scheme would be to have regular 6502 intermixed with escape codes which will substitute in JSRs or larger blocks of code. That way your loops can still be tight comparisons on flags & register values in plain 6502, and boilerplate code can be reduced in application-specific ways.

Bregalad · Post by **Bregalad** » Wed May 13, 2015 12:49 pm

In my previous thread about program compression, the common consensus was that using a VM is the best way to compress a program. I feel the smell of an infinite loop

Quote:

I really don't see where you would be gaining a great deal with the second procedure. You still have to have the overhead of "translating" the tokenized program text to the appropriate machine instructions. The logical place to do that is during compile-time, not run-time.

The great advantage is if there is a loop, the bytecode is passed through only once, and not N times, thus a potential gain in speed. If it isn't for loops, there is no possible gains in such a scheme.

Quote:

Slightly easier, take two passes: on the second pass, you know where the branch targets are

Interesting idea, however, I don't quite get it. What work do you do exactly on the 1st and second pass?

BigEd · Post by **BigEd** » Wed May 13, 2015 1:45 pm

Bregalad wrote:

Quote:

Slightly easier, take two passes: on the second pass, you know where the branch targets are

Interesting idea, however, I don't quite get it. What work do you do exactly on the 1st and second pass?

Great question. I didn't actually think very hard about that. I was thinking that it would help to run through a first pass without generating code but just accounting for lengths of code and making a list of destinations. Then the second pass knows the native addresses for the destinations, at the time that it's building the code for the branches and jumps.

There are different approaches to JITting, one of which is 'tracing JIT' - that might be close to what I was thinking. But I haven't studied these, far less implemented them - just read about them.

Sorry not to be very helpful
Ed

White Flame · Post by **White Flame** » Wed May 13, 2015 9:24 pm

To expand on my idea above, since the primary goal is speed, and the secondary is size, I think having inline non-bytecoded 6502 code would be important. You could elide bytecode implementations for many small or peephole-optimizable instructions, hopefully reducing your ROM footprint. If the expanded bytecode ABI left the N/V/Z/C flags for 6502 branch instructions to see, that would be quite fast during runtime, and would eliminate implementations for an entire class of instructions.

Another idea for the JITter would be to merge your bytecode with 6502 opcodes. This way, the inline 6502 doesn't need any escaped headers; the bytecode would just take over some of the 6502 instruction space. This might make the code decompressor a little bigger (have to know or calculate opcode/length pairs for the 6502 instructions), but would make all your scripts smaller assuming inlining.

The inline 6502 code would ideally be assembled such that the branch targets match the expanded JIT code, to avoid run-time fixup. That should be possible using simple assembler macros.

oullr · Post by **oullr** » Thu May 14, 2015 7:11 pm

A relatively simple technique to accelerate VM execution by -say- 20-50% is Direct Threading (see https://en.wikipedia.org/wiki/Threaded_ ... _threading). It fattens the code a bit, though.

bogax · Post by **bogax** » Sun May 17, 2015 3:44 pm

Implement [something like] local labels in your byte code interpreter.

jgharston · Post by **jgharston** » Wed May 27, 2015 11:06 pm

Bregalad wrote:

I wonder what techniques could be used to aim at such a goal. Something that comes naturally would be to replace the execute loop by something that copies instructions to RAM then jumps to that code:
The old way is:

Fetch an instruction from ROM
Read jump table and jump to instruction's address
The instruction's code is executed from ROM
Increment the bytecode pointer
Jump back into interpreter

You can speed that up by making the tail of the instruction code the fetch-and-decode code, then you get:

Fetch an instruction from ROM
Read jump table and jump to instruction's address
The instruction's code is executed from ROM
Increment the bytecode pointer
Fetch an instruction from ROM
Read jump table and jump to instruction's address

etc.

It depends on the implementation whether it takes more code or not. Any optimisation is a balancing between speed and size.

Bregalad · Post by **Bregalad** » Thu May 28, 2015 12:09 pm

Well, there is no need to answer me anymore, I've decided that it was too complicated anyway and I would not go to this direction.

But yes, doing classic optimisations (by "classic" I mean without changing everything) to the interpreter is probably the best way to make it faster, especially considering optimising it for the most used instructions should be enough.

Druzyek · Post by **Druzyek** » Thu May 28, 2015 3:46 pm

Quote:

So in a sense, BBx or Thoroughbred BASIC programs are mostly interpreted as they are generated, rather than as they run, greatly improving performance.

Just out of curiosity, if you have enough RAM and storage space on a modern machine, why even interpret at all? Wouldn't it make more since to just generate machine code if you're generating anything?

barrym95838 · Post by **barrym95838** » Thu May 28, 2015 5:43 pm

Druzyek wrote:

Just out of curiosity, if you have enough RAM and storage space on a modern machine, why even interpret at all? Wouldn't it make more since to just generate machine code if you're generating anything?

Well-written machine (code) is impossible to beat for speed, but can be easily beaten in density by an interpreter on many architectures, including 65xx. It's finding the right balance for the particular application which can be a bit irksome.

Mike B.

Making bytecode interpreter faster (without eating more ROM)

Making bytecode interpreter faster (without eating more ROM)

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more

Re: Making bytecode interpreter faster (without eating more