I regard x86 as unaligned 16 bit instructions mostly implemented with a suffix byte for addressing mode, five sets of short instructions primarily to get source compatible 8080 assembly down to a sensible size and an increasing miscellany of bad ideas. 8087? VEX/REX prefix? Intel CET? Who thought any of this was a good idea?
The 16 bit instructions are the major concern. The obvious implementation strategy is a macro or subroutine call which performs its own bytecode instruction fetch and returns the computed address. However, that doesn't work because I understand that some instructions have exception cases. You will require multiple implementations or work-arounds for the exception cases. Have you considered 68000 on 6502? This might be an easier task.
For prefix bytes, cascade into a separate tree of instruction decode. This requires one root table of 256 cases plus one additional table for every prefix. The tables may be heavily dimered and cross-linked. For example, instructions where a prefix has no effect will be referenced in multiple places. Redundant prefix sequences within the instruction stream require self-reference and cross-reference.
Handling the mis-matched flags is awkward. Possibly the best strategy is to hold Z flag on native stack and unconditionally compute other flags even if they will be immediately discarded. Specific cases can be handled as if they were instruction prefixes. However, there is a combinatorial explosion of partial matches which have to be handled. Also, detection of specific cases is very sensitive to programming style.
Martin A on Fri 11 Mar 2022 wrote:
256 bytes per opcode. ... Each routine then ends with a long jump back to the opcode fetch.
If you have oodles of space, don't branch back then branch forward again. Place one (or more) copies of the bytecode instruction fetch within each page. Yes, the majority of your bytecode interpreter will be redundant code but it compresses nicely and you weren't doing anything else with the space.
xlar54 on Fri 11 Mar 2022 wrote:
Im working on a simple 8086 emulator for the 65816
xlar54 targets specific 65816 hardware but I wonder if 8086 on 6502/65816 simulation would benefit by having a dedicated 1MB RAM with hardware acceleration for segment address calculation. This could be achieved with one 74HC138, two latches for address, two latches for segment, one read strobe and one write strobe. Four 4 bit adders can be used to calculate the address and no look-ahead hardware is required. This eliminates many of the circumstances where 4 bit shifts are required. It ensures that the bytecode interpreter is outside of the address-space of the guest environment and it has no adverse effect on I/O segment handling.
The two unused strobes could be used for a hypothetical x86 extension. After all of the time-wasting between 8086 and 80386 (and further time-wasting between 80386 and 80586), many people have suggested that 8086 with 8 bit pre-scale on segment registers would be highly desirable. This would allow simple binaries to run unchanged within a larger address-space. It also simplifies address comparison. I believe that Intel's official macro definition for 20 bit address comparison is 44 bytes. That's ridiculous. With an 8 bit pre-scale, this could be considerably simplified. For dedicated hardware into 16MB RAM, this requires one 74HC138, four latches, five 4 bit adders, five 4 bit multiplexers and one bit of state for pre-scale mode. A bytecode interpreter requires minor modification to set mode but would otherwise run without modification. In particular, it would execute without performance penalty in either mode.