Do there exist 6502 -> PC compilers ???

H.G.Muller · Post by **H.G.Muller** » Thu Aug 09, 2007 9:03 pm

I am thinking ahead a little bit. Once I succeed in running my old (assembler-written) Chess program on the PC as a subroutine in a C program, really calling the 6502 emulator as a subroutine, I would be ready to make the next step.

The emulation will of course involve a large speed penalty, and for Chess programs speed is important. Of course a 2.4GHz Core 2 Duo will most likely still perform better than a 100 MHz 6502, but I am greedy, and would like to have the performance of a 3GHz 6502.

This can probably only be achieved by compilation, rather than emulation. Do there exist compilers that convert plain 6502 assembler into equivalent i386 machine code? With 'plain' I mean that it does not have to deal with self-modifying code. It just has to take a section of 6502 code that is not jumped into except at labels, and transform it to the fastest sequence of i386 instructions that achieves the same thing.

Does such a compiler exist?

kc5tja · Post by **kc5tja** » Sat Aug 11, 2007 2:37 am

Nope. But if you're willing, you can be the first. There is a tool called LLVM (see http://www.llvm.org for more information) which is a C++ library implementing an abstract syntax tree and code generating back-end. It's an excellent tool for compiler writers.

H.G.Muller · Post by **H.G.Muller** » Sun Aug 12, 2007 11:51 am

That was what I was afraid of. So I'll probably write my own.

I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.

Basically every 6502 instruction has a group of x68 instructions doing the same thing, after mapping the 6502 A, X, Y, S and P on the x86 AL, BL, BH, CH and AH. (I won't emulate the PC, programs that retrieve a return address from the stack and start manipulating it, or in general do arithmetic on code addresses, fall outside the scope of this project.) Usually there are x68 that do exactly the same on the registers, and you could precceed them by the address calculation if the addressing mode is not available.

The effect on the flags for both machines is different enough to cause problems, though. Fully emulating P in AH, by suffixing every translation of a 6502 instruction by SETxx instructions testing the affected flags, and copying them to the appropriate bit of AH, would be very expensive, though. And mostly a waste of time! As almost any 6502 instruction sets the N and Z flag, and hardly any instruction uses it (only the appropriate branches, and instructions that explicitly reference P, such as PHP and BRK). Thus most flag settings are never used before they are overwritten again.

So I wanted to eliminate this overhead by, at compile time, determining which flag settings are used, and which aren't. For flags that are used, I would than check if the x86 equivalent of the intervening instructions would obliterate them on x86. (For C and V this will usually be the case, if there are intervening 6502 instructions between a CMP and BVC, as almost all x86 instructions affect C, while almost no 6502 instructions do. For N and V translation of 6502 addressing modes might destroy the flag settings on the x86.) Only if the flags would not survive to the point where they are used by an x86 instruction that uses it the same way (so branches, or ADC), or would be needed explicitly (PHP or BRK), I would include an explicit copying of the flag to the emulated flag bit in AH, and preceed the instruction using the flag by the appropriate test of these bits.

Similarly, I would keep both an 8-bit version of X and Y, and its zero-extended 32-bit versions X' and Y' for use in the absolute-indexed addressing modes. And normally not create the 32-bit copies unless the X or Y content was indeed used for addressing by a later instruction. Then the zp, abs, abs,X and abs,Y modes can all be translated by the equivalent x86 mode. And (zp),Y only needs a MOV instruction for the indirection (conserving all x86 flags). Only zp,X and zp,Y are a pain, due to the wrap-around. These often-used modes thus need an explicit addition in an 8-bit register, or LEA plus ANDL #FFh in a 32-bit register, both sequences destroying the flags. Same for (zp,X), but fortunately this mode is seldomly used.

What annoys me most is that the zero-paged index modes need a very cumbersome treatment because of the wrap-around, while in practice the wrap-around feature will be hardly ever used. So I might include an option where these modes are simply treated as if they were absolute indexed modes. To really do this well on x86 hardware, you would have to define the zero-page as a separate segment (aliased with a larger segment) of only 256 bytes size, and translate the indexed zero-page modes using that segment (and normal x86 32-bit indexed addressing). Then any access that would need wrap-around would be trapped as a segmentation violation. You could than make the trap handler emulate the wrap-around, and you would suffer zero overhead if wrap-around did never occur.

fachat · Post by **fachat** » Mon Aug 13, 2007 9:38 am

H.G.Muller wrote:

That was what I was afraid of. So I'll probably write my own.

I'm looking forward to the results :)

Quote:

Basically every 6502 instruction has a group of x68 instructions doing the same thing, after mapping the 6502 A, X, Y, S and P on the x86 AL, BL, BH, CH and AH. (I won't emulate the PC, programs that retrieve a return address from the stack and start manipulating it, or in general do arithmetic on code addresses, fall outside the scope of this project.) Usually there are x68 that do exactly the same on the registers, and you could precceed them by the address calculation if the addressing mode is not available.

I think you have to handle address arithmetic in a way that at least address tables and (address-1) tables are handled. The former where it's loaded into a point for JMP(ABS) and the latter, even more important I think, for pushing it on the stack, for doing an RTS to it.
This is a common programming pattern

André

H.G.Muller · Post by **H.G.Muller** » Mon Aug 13, 2007 12:24 pm

OK, thanks for the advice. Tables of addresses for JMP (abs) is not really what I would consider 'address arithmetic'. In a 'clean' program the table would be initialized with a list of labels in the code segment, and it would work even if the instruction translations would have different length as they would have in true 6502 machine code.

The other case you mention is more tricky. It would require the compiler to know when the top of the stack is an address pushed there by a JSR (and which one?), and take special action if an RTS occurs where the top-of-stack is not an address pushed earlier. I guess the special action would be popping the address and use it in an indirect x86 jump to the address+1. This to avoid messing up the return stack of the x86 branch-prediction unit.

IMO this construct of pushing addresses and using RTS to jump to them borders dangerously close to 'dirty code'. Because there is a much more straightforward alternative of the JMP (abs). But if you say it is common 6502 idiom, I'll take your word for it.

To solve the problem, the compiler would then have to do an emulation of the stack contents (address vs data) for each possible path of control flow. I.e., remember for each stacked byte which instruction stacked it (PHA/PHP vs JSR low or high address byte). And if it discovers a discrepancy at points of confluence, it should mark all the items where the stack differed as 'undefined'. Programs are then defined as 'clean' when they only use well-defined stack items in RTS instructions (and refrain from writing addresses into the stack by indexed modes). All very cumbersome, but doable.

kc5tja · Post by **kc5tja** » Tue Aug 14, 2007 2:57 pm

Quote:

What annoys me most is that the zero-paged index modes need a very cumbersome treatment because of the wrap-around, while in practice the wrap-around feature will be hardly ever used.

Is it possible to statically analyze the code being translated to see if $FF appears as a direct-page address anywhere?

BTW, I think the 65816 does away completely with the wrap-around, at least in native-mode.

Quote:

any access that would need wrap-around would be trapped as a segmentation violation. You could than make the trap handler emulate the wrap-around, and you would suffer zero overhead if wrap-around did never occur.

You could also use paging for this. Simply map the direct-page to the last 256 bytes of a page that has a guard page after it. It'll consume 8KB of (virtual) memory, but it'll catch any attempts to access memory in the guard page via a page fault.

kc5tja · Post by **kc5tja** » Tue Aug 14, 2007 3:17 pm

Quote:

The other case you mention is more tricky. It would require the compiler to know when the top of the stack is an address pushed there by a JSR (and which one?)

JSR can be mapped to three x86 instructions:

Code: Select all

  push address_minus_one
  jmp address

This should be quite zippy. While not AS fast as a real, honest-to-goodness subroutine call in x86-land (1.5 to 3 cycles), it's still (I think) a lot better than what you're thinking. RTS is similarly coded:

Code: Select all

  pop eax
  inc eax
  jmp eax

What do you mean by, "and which one?"

Quote:

IMO this construct of pushing addresses and using RTS to jump to them borders dangerously close to 'dirty code'. Because there is a much more straightforward alternative of the JMP (abs).

The practice of using RTS is much faster, and is also re-entrant. Using JMP (abs) requires a global variable, which might cause problems, particularly when/if an interrupt handler happens to do an indirect jump and trashes your pointer.

Quote:

To solve the problem, the compiler would then have to do an emulation of the stack contents (address vs data) for each possible path of control flow. I.e., remember for each stacked byte which instruction stacked it (PHA/PHP vs JSR low or high address byte). And if it discovers a discrepancy at points of confluence, it should mark all the items where the stack differed as 'undefined'. Programs are then defined as 'clean' when they only use well-defined stack items in RTS instructions (and refrain from writing addresses into the stack by indexed modes). All very cumbersome, but doable.

A data flow analysis is also possible, and with static single assignment occuring at the optimization layer, I think preferable. For example, if you modeled the carry flag, negative flag, etc. as separate boolean values and threaded their states correctly, then any good SSA optimizer would dispense entirely with the code that manages unused flags.

This is why I recommended looking at LLVM, as it abstracts the entirety of a compiler back-end in a nice, easy-to-use C++ API. The only thing it probably wouldn't be able to handle is self-modifying code, but it looks as though you're not targeting that aspect. (Which, BTW, is another semi-common practice on the 6502 architecture in particular, again due to the lack of more general purpose instructions/addressing modes.)

Although SSA is sophisticated enough to model CPU flags on their own level, few compiler back-ends exploit this, so LLVM probably won't be able to exploit carry-over either. However, I am looking at the possibility of engineering a functional programming language for the 65816, which does have an optimizer that is "flags aware." But, it's vaporware for the time being, so don't wait on me.

dclxvi · Post by **dclxvi** » Wed Aug 15, 2007 4:08 am

kc5tja wrote:

Is it possible to statically analyze the code being translated to see if $FF appears as a direct-page address anywhere?

Um, the issue here is not:

LDA ($FF),Y

which is extremely rare to nonexistent (I can't think of a single program with an LDA ($FF),Y instruction), but zp wrap around like (from http://6502.org/source/floats/wozfp1.txt):

Code: Select all

1FB0  A2 FD            LDX =$FD    INDEX FOR 3-BYTE CONDITIONAL MOVE
1FB2  68        DIV3   PLA         PULL A BYTE OF DIFFERENCE OFF STACK
1FB3  90 02            BCC DIV4    IF MANT2<E THEN DONT RESTORE MANT2
1FB5  95 08            STA M2+3,X
1FB7  E8        DIV4   INX         NEXT LESS SIGNIF BYTE
1FB8  D0 F8            BNE DIV3    LOOP UNTIL DONE

Since M2 (the zp address) is $05, and STA M2+3,X assembles to 95 08, that, in and of itself, is not going to be much of a hint that wrap around will occur; similar code starts at $F4C5 in http://6502.org/source/floats/wozfp3.txt. Likewise, in 64-bit or 128-bit division routines, where the comparison is high byte to low byte and the subtraction is done low byte to high byte, X (when used for indexing) will be incremented for one operation and decremented for the other, whether the bytes are stored in memory low byte first or high byte first.

A much less common case would be an ($FE,X) operand (likely near a (0,X) operand), which wraps around unless X is zero.

kc5tja wrote:

BTW, I think the 65816 does away completely with the wrap-around, at least in native-mode.

Yes. The 65816 has no page wrap around at all in native mode, but there is bank wrap around, e.g. when X is $FFFF, LDA 1,X (B5 01) reads address $000000 (i.e. bank 0). (Absolute addressing can span a bank boundary, though; LDA $0001,X (BD 01 00) reads address $010000, i.e. bank 1., when X is $FFFF) In emulation mode, there is stack, zp, etc. page wrap around, of course.

kc5tja wrote:

The practice of using RTS is much faster, and is also re-entrant. Using JMP (abs) requires a global variable, which might cause problems, particularly when/if an interrupt handler happens to do an indirect jump and trashes your pointer.

In addition to what has already been mentioned, another reason why

LDA TABLE_HI,X
PHA
LDA TABLE_LO,X
PHA
RTS

is/was used (and yes, it's quite common) is that the code is shorter (less space) than

LDA TABLE_HI,X
STA DEST+1
LDA TABLE_LO,X
STA DEST
JMP (DEST)

which was once an very important factor. However, when DEST is on the zero page, the latter takes one fewer cycle on the NMOS 6502 (they both take the same number of cycles on the 65C02). (Both forms have been used; Apple II Integer Basic used the latter with DEST on the zero page, but Sweet 16 (in the repository) uses the former, and both had the same author). Also, JMP (LABEL) has a bug on the NMOS 6502 when LABEL is $XXFF, which has resulted in the use of JMP (abs) being discouraged.

The 65C02 has a JMP (abs,X) instruction which largely eliminates the need for RTS-style jump tables, but a lot of software is/was written so as not to exclude the NMOS 6502 whenever feasible.

BigEd · Post by **BigEd** » Sat Aug 14, 2010 3:56 pm

H.G.Muller wrote:

Do there exist compilers that convert plain 6502 assembler into equivalent i386 machine code? With 'plain' I mean that it does not have to deal with self-modifying code.

I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.

I just posted this which mentions Bill Forster's approach - he uses a macro for each opcode. With luck a good compiler would be able to optimise things like flags which are not checked before the next time they are updated.

See also Sarah Walker's RPCemu, which emulates an ARM-based computer, and which has an optional mode where it translates direct to x86 machine code. She reports a 3x to 4x speed benefit.

I've been looking recently at the performance possible by emulating 6502 on ARM: see this thread.

Cheers
Ed