I am thinking ahead a little bit. Once I succeed in running my old (assembler-written) Chess program on the PC as a subroutine in a C program, really calling the 6502 emulator as a subroutine, I would be ready to make the next step.
The emulation will of course involve a large speed penalty, and for Chess programs speed is important. Of course a 2.4GHz Core 2 Duo will most likely still perform better than a 100 MHz 6502, but I am greedy, and would like to have the performance of a 3GHz 6502.
This can probably only be achieved by compilation, rather than emulation. Do there exist compilers that convert plain 6502 assembler into equivalent i386 machine code? With 'plain' I mean that it does not have to deal with self-modifying code. It just has to take a section of 6502 code that is not jumped into except at labels, and transform it to the fastest sequence of i386 instructions that achieves the same thing.
Does such a compiler exist?
Do there exist 6502 -> PC compilers ???
-
H.G.Muller
- Posts: 7
- Joined: 26 Jul 2007
- Location: Amsterdam
- Contact:
Nope. But if you're willing, you can be the first. There is a tool called LLVM (see http://www.llvm.org for more information) which is a C++ library implementing an abstract syntax tree and code generating back-end. It's an excellent tool for compiler writers.
-
H.G.Muller
- Posts: 7
- Joined: 26 Jul 2007
- Location: Amsterdam
- Contact:
That was what I was afraid of. So I'll probably write my own.
I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.
Basically every 6502 instruction has a group of x68 instructions doing the same thing, after mapping the 6502 A, X, Y, S and P on the x86 AL, BL, BH, CH and AH. (I won't emulate the PC, programs that retrieve a return address from the stack and start manipulating it, or in general do arithmetic on code addresses, fall outside the scope of this project.) Usually there are x68 that do exactly the same on the registers, and you could precceed them by the address calculation if the addressing mode is not available.
The effect on the flags for both machines is different enough to cause problems, though. Fully emulating P in AH, by suffixing every translation of a 6502 instruction by SETxx instructions testing the affected flags, and copying them to the appropriate bit of AH, would be very expensive, though. And mostly a waste of time! As almost any 6502 instruction sets the N and Z flag, and hardly any instruction uses it (only the appropriate branches, and instructions that explicitly reference P, such as PHP and BRK). Thus most flag settings are never used before they are overwritten again.
So I wanted to eliminate this overhead by, at compile time, determining which flag settings are used, and which aren't. For flags that are used, I would than check if the x86 equivalent of the intervening instructions would obliterate them on x86. (For C and V this will usually be the case, if there are intervening 6502 instructions between a CMP and BVC, as almost all x86 instructions affect C, while almost no 6502 instructions do. For N and V translation of 6502 addressing modes might destroy the flag settings on the x86.) Only if the flags would not survive to the point where they are used by an x86 instruction that uses it the same way (so branches, or ADC), or would be needed explicitly (PHP or BRK), I would include an explicit copying of the flag to the emulated flag bit in AH, and preceed the instruction using the flag by the appropriate test of these bits.
Similarly, I would keep both an 8-bit version of X and Y, and its zero-extended 32-bit versions X' and Y' for use in the absolute-indexed addressing modes. And normally not create the 32-bit copies unless the X or Y content was indeed used for addressing by a later instruction. Then the zp, abs, abs,X and abs,Y modes can all be translated by the equivalent x86 mode. And (zp),Y only needs a MOV instruction for the indirection (conserving all x86 flags). Only zp,X and zp,Y are a pain, due to the wrap-around. These often-used modes thus need an explicit addition in an 8-bit register, or LEA plus ANDL #FFh in a 32-bit register, both sequences destroying the flags. Same for (zp,X), but fortunately this mode is seldomly used.
What annoys me most is that the zero-paged index modes need a very cumbersome treatment because of the wrap-around, while in practice the wrap-around feature will be hardly ever used. So I might include an option where these modes are simply treated as if they were absolute indexed modes. To really do this well on x86 hardware, you would have to define the zero-page as a separate segment (aliased with a larger segment) of only 256 bytes size, and translate the indexed zero-page modes using that segment (and normal x86 32-bit indexed addressing). Then any access that would need wrap-around would be trapped as a segmentation violation. You could than make the trap handler emulate the wrap-around, and you would suffer zero overhead if wrap-around did never occur.
I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.
Basically every 6502 instruction has a group of x68 instructions doing the same thing, after mapping the 6502 A, X, Y, S and P on the x86 AL, BL, BH, CH and AH. (I won't emulate the PC, programs that retrieve a return address from the stack and start manipulating it, or in general do arithmetic on code addresses, fall outside the scope of this project.) Usually there are x68 that do exactly the same on the registers, and you could precceed them by the address calculation if the addressing mode is not available.
The effect on the flags for both machines is different enough to cause problems, though. Fully emulating P in AH, by suffixing every translation of a 6502 instruction by SETxx instructions testing the affected flags, and copying them to the appropriate bit of AH, would be very expensive, though. And mostly a waste of time! As almost any 6502 instruction sets the N and Z flag, and hardly any instruction uses it (only the appropriate branches, and instructions that explicitly reference P, such as PHP and BRK). Thus most flag settings are never used before they are overwritten again.
So I wanted to eliminate this overhead by, at compile time, determining which flag settings are used, and which aren't. For flags that are used, I would than check if the x86 equivalent of the intervening instructions would obliterate them on x86. (For C and V this will usually be the case, if there are intervening 6502 instructions between a CMP and BVC, as almost all x86 instructions affect C, while almost no 6502 instructions do. For N and V translation of 6502 addressing modes might destroy the flag settings on the x86.) Only if the flags would not survive to the point where they are used by an x86 instruction that uses it the same way (so branches, or ADC), or would be needed explicitly (PHP or BRK), I would include an explicit copying of the flag to the emulated flag bit in AH, and preceed the instruction using the flag by the appropriate test of these bits.
Similarly, I would keep both an 8-bit version of X and Y, and its zero-extended 32-bit versions X' and Y' for use in the absolute-indexed addressing modes. And normally not create the 32-bit copies unless the X or Y content was indeed used for addressing by a later instruction. Then the zp, abs, abs,X and abs,Y modes can all be translated by the equivalent x86 mode. And (zp),Y only needs a MOV instruction for the indirection (conserving all x86 flags). Only zp,X and zp,Y are a pain, due to the wrap-around. These often-used modes thus need an explicit addition in an 8-bit register, or LEA plus ANDL #FFh in a 32-bit register, both sequences destroying the flags. Same for (zp,X), but fortunately this mode is seldomly used.
What annoys me most is that the zero-paged index modes need a very cumbersome treatment because of the wrap-around, while in practice the wrap-around feature will be hardly ever used. So I might include an option where these modes are simply treated as if they were absolute indexed modes. To really do this well on x86 hardware, you would have to define the zero-page as a separate segment (aliased with a larger segment) of only 256 bytes size, and translate the indexed zero-page modes using that segment (and normal x86 32-bit indexed addressing). Then any access that would need wrap-around would be trapped as a segmentation violation. You could than make the trap handler emulate the wrap-around, and you would suffer zero overhead if wrap-around did never occur.
H.G.Muller wrote:
That was what I was afraid of. So I'll probably write my own.
Quote:
Basically every 6502 instruction has a group of x68 instructions doing the same thing, after mapping the 6502 A, X, Y, S and P on the x86 AL, BL, BH, CH and AH. (I won't emulate the PC, programs that retrieve a return address from the stack and start manipulating it, or in general do arithmetic on code addresses, fall outside the scope of this project.) Usually there are x68 that do exactly the same on the registers, and you could precceed them by the address calculation if the addressing mode is not available.
This is a common programming pattern
André
-
H.G.Muller
- Posts: 7
- Joined: 26 Jul 2007
- Location: Amsterdam
- Contact:
OK, thanks for the advice. Tables of addresses for JMP (abs) is not really what I would consider 'address arithmetic'. In a 'clean' program the table would be initialized with a list of labels in the code segment, and it would work even if the instruction translations would have different length as they would have in true 6502 machine code.
The other case you mention is more tricky. It would require the compiler to know when the top of the stack is an address pushed there by a JSR (and which one?), and take special action if an RTS occurs where the top-of-stack is not an address pushed earlier. I guess the special action would be popping the address and use it in an indirect x86 jump to the address+1. This to avoid messing up the return stack of the x86 branch-prediction unit.
IMO this construct of pushing addresses and using RTS to jump to them borders dangerously close to 'dirty code'. Because there is a much more straightforward alternative of the JMP (abs). But if you say it is common 6502 idiom, I'll take your word for it.
To solve the problem, the compiler would then have to do an emulation of the stack contents (address vs data) for each possible path of control flow. I.e., remember for each stacked byte which instruction stacked it (PHA/PHP vs JSR low or high address byte). And if it discovers a discrepancy at points of confluence, it should mark all the items where the stack differed as 'undefined'. Programs are then defined as 'clean' when they only use well-defined stack items in RTS instructions (and refrain from writing addresses into the stack by indexed modes). All very cumbersome, but doable.
The other case you mention is more tricky. It would require the compiler to know when the top of the stack is an address pushed there by a JSR (and which one?), and take special action if an RTS occurs where the top-of-stack is not an address pushed earlier. I guess the special action would be popping the address and use it in an indirect x86 jump to the address+1. This to avoid messing up the return stack of the x86 branch-prediction unit.
IMO this construct of pushing addresses and using RTS to jump to them borders dangerously close to 'dirty code'. Because there is a much more straightforward alternative of the JMP (abs). But if you say it is common 6502 idiom, I'll take your word for it.
To solve the problem, the compiler would then have to do an emulation of the stack contents (address vs data) for each possible path of control flow. I.e., remember for each stacked byte which instruction stacked it (PHA/PHP vs JSR low or high address byte). And if it discovers a discrepancy at points of confluence, it should mark all the items where the stack differed as 'undefined'. Programs are then defined as 'clean' when they only use well-defined stack items in RTS instructions (and refrain from writing addresses into the stack by indexed modes). All very cumbersome, but doable.
Quote:
What annoys me most is that the zero-paged index modes need a very cumbersome treatment because of the wrap-around, while in practice the wrap-around feature will be hardly ever used.
BTW, I think the 65816 does away completely with the wrap-around, at least in native-mode.
Quote:
any access that would need wrap-around would be trapped as a segmentation violation. You could than make the trap handler emulate the wrap-around, and you would suffer zero overhead if wrap-around did never occur.
Quote:
The other case you mention is more tricky. It would require the compiler to know when the top of the stack is an address pushed there by a JSR (and which one?)
Code: Select all
push address_minus_one
jmp address
Code: Select all
pop eax
inc eax
jmp eax
Quote:
IMO this construct of pushing addresses and using RTS to jump to them borders dangerously close to 'dirty code'. Because there is a much more straightforward alternative of the JMP (abs).
Quote:
To solve the problem, the compiler would then have to do an emulation of the stack contents (address vs data) for each possible path of control flow. I.e., remember for each stacked byte which instruction stacked it (PHA/PHP vs JSR low or high address byte). And if it discovers a discrepancy at points of confluence, it should mark all the items where the stack differed as 'undefined'. Programs are then defined as 'clean' when they only use well-defined stack items in RTS instructions (and refrain from writing addresses into the stack by indexed modes). All very cumbersome, but doable.
This is why I recommended looking at LLVM, as it abstracts the entirety of a compiler back-end in a nice, easy-to-use C++ API. The only thing it probably wouldn't be able to handle is self-modifying code, but it looks as though you're not targeting that aspect. (Which, BTW, is another semi-common practice on the 6502 architecture in particular, again due to the lack of more general purpose instructions/addressing modes.)
Although SSA is sophisticated enough to model CPU flags on their own level, few compiler back-ends exploit this, so LLVM probably won't be able to exploit carry-over either. However, I am looking at the possibility of engineering a functional programming language for the 65816, which does have an optimizer that is "flags aware." But, it's vaporware for the time being, so don't wait on me.
kc5tja wrote:
Is it possible to statically analyze the code being translated to see if $FF appears as a direct-page address anywhere?
LDA ($FF),Y
which is extremely rare to nonexistent (I can't think of a single program with an LDA ($FF),Y instruction), but zp wrap around like (from http://6502.org/source/floats/wozfp1.txt):
Code: Select all
1FB0 A2 FD LDX =$FD INDEX FOR 3-BYTE CONDITIONAL MOVE
1FB2 68 DIV3 PLA PULL A BYTE OF DIFFERENCE OFF STACK
1FB3 90 02 BCC DIV4 IF MANT2<E THEN DONT RESTORE MANT2
1FB5 95 08 STA M2+3,X
1FB7 E8 DIV4 INX NEXT LESS SIGNIF BYTE
1FB8 D0 F8 BNE DIV3 LOOP UNTIL DONE
A much less common case would be an ($FE,X) operand (likely near a (0,X) operand), which wraps around unless X is zero.
kc5tja wrote:
BTW, I think the 65816 does away completely with the wrap-around, at least in native-mode.
kc5tja wrote:
The practice of using RTS is much faster, and is also re-entrant. Using JMP (abs) requires a global variable, which might cause problems, particularly when/if an interrupt handler happens to do an indirect jump and trashes your pointer.
LDA TABLE_HI,X
PHA
LDA TABLE_LO,X
PHA
RTS
is/was used (and yes, it's quite common) is that the code is shorter (less space) than
LDA TABLE_HI,X
STA DEST+1
LDA TABLE_LO,X
STA DEST
JMP (DEST)
which was once an very important factor. However, when DEST is on the zero page, the latter takes one fewer cycle on the NMOS 6502 (they both take the same number of cycles on the 65C02). (Both forms have been used; Apple II Integer Basic used the latter with DEST on the zero page, but Sweet 16 (in the repository) uses the former, and both had the same author). Also, JMP (LABEL) has a bug on the NMOS 6502 when LABEL is $XXFF, which has resulted in the use of JMP (abs) being discouraged.
The 65C02 has a JMP (abs,X) instruction which largely eliminates the need for RTS-style jump tables, but a lot of software is/was written so as not to exclude the NMOS 6502 whenever feasible.
H.G.Muller wrote:
Do there exist compilers that convert plain 6502 assembler into equivalent i386 machine code? With 'plain' I mean that it does not have to deal with self-modifying code.
I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.
I already looked into the problem, and it does not seem that hard. I don't think it would even have the complexity requiring usual compiler syntax trees and code generation from them. 6502 assembler and x86 assembler is nearly a 1:1 mapping. Except for some of the rarer addressing modes, for which you could simply substitute a group of x86 instructions calculating the address in an x86 register reserved for that purpose, followed by the regular translation of the instruction.
See also Sarah Walker's RPCemu, which emulates an ARM-based computer, and which has an optional mode where it translates direct to x86 machine code. She reports a 3x to 4x speed benefit.
I've been looking recently at the performance possible by emulating 6502 on ARM: see this thread.
Cheers
Ed