The old way is:
- Fetch an instruction from ROM
- Read jump table and jump to instruction's address
- The instruction's code is executed from ROM
- Increment the bytecode pointer
- Jump back into interpreter
- Fetch an instruction from ROM
- Read the "jump table", but copy the target's content to RAM buffer instead of executing it
- Increment the bytecode pointer
- Repeat for all instructions in a basic bloc
- JSR to the RAM buffer, which will execute multiple instructions at once
The problem is for jump, calls and branches instructions (thus the notion of a basic bloc). The immediate solution is to stop the loop I describe above at the first of such instructions, and interpret them "normally".
Unfortunately, a basic block will be 3-5 instructions long in most cases, this render this whole idea useless.
So the other idea would be to include at least branches in the "basic block" part, so only jumps and calls would need to exit the RAM buffer and return to the interpreter. The only idea I'd have would be to make all instruction's size directly proportional to their bytecode's size (for example 1 byte instructions are 8 bytes interpreter, 2 bytes instructions 16 bytes and so on), so that the branch offsets can be shifted left a couple of time to become the "real" branch offsets.
This sounds like a cool idea, but seriously limits what the interpreter can do, sometimes there will not be enough space for instructions you'd like to implement, and sometimes you'll have to waste ROM space with NOPs (and thus also making interpreted code slower) in order to fulfill that condition.
So the only solution that is practical would be to know at compile time how much bytes each instruction takes, and account for them in the branches, so that the branch offset would refer to 6502 instructions and not to bytecode itself. Needless to say, the portability and maintainability of the byte code machine seriously suffers with this approach.