Zero page as registers or globals?

65LUN02 · Post by **65LUN02** » Fri Aug 26, 2022 4:14 am

BillO wrote:

I have to read this entire thread, but it seems like you want to implement a virtual machine in the new assembler over and above the native 6502 capabilities WRT registers. Am I close?

Sort of. My particular testbed is a the iz6502 emulator, which is written in Go, and my own assembler+ also written in Go, both underlying the izapple2 emulator. With all that, I’m able in minutes to add new instructions to the CPU and assembler, and have them running on a command line.

What I found from that toolchain is that it is too easy to get from idea to opcode. Too easy as there are literally just minutes e.g. between thinking AXY (A = X + Y) is a good idea and having axy in a working program.

It makes for a very fast turn around between having an idea and seeing it through but it’s not enough time to know whether you are wasting precious opcode space on a frivolous operation. The beauty of the 6502 is so little frivolity. The NMOS chip didn’t even have INC or DEC!

That said, what intrigues me from this thread is the idea of building a language a half step up between assembly and your favorite higher language (mine being C). So no, not a VM but instead a new set of mnemonics and syntax that translate down to “traditional” 6592 assembly, but at that medium-level language have 32 or 64 or 128 registers plus X and Y instead of A, X, and Y. Leave A as a workspace register, just like the DR and B registers that exist but are not directly addressable, and what happens? Do we just get a slow 6502 or do we get something more capable?

In terms of SWEET16, my testbed CPU has 24-bit addresses and 8/16/24 bit registers, but that’s a discussion for another thread (see 65C2402).

Over here my original question was basically, “why don’t we treat zp like 255 registers?” and the answer seems to be, because the assembly mnemonics make zp look and feel like memory, not registers. That and (zp),y is useful given 8-bit registers but 16-bit addresses.

65LUN02 · Post by **65LUN02** » Fri Aug 26, 2022 4:26 am

Proxy wrote:

LDR0 $1234 loads a byte from Address $1234 into R0
LDW0 $1234 loads 2 bytes, one from Address $1234 into R0, and the other from $1235 into R1

Nice! This is a great example of what can be trivially done with a half-step-up assembler (or easy enough for an assembler macro too in this specific case). LDWn can simply spit out four lines of assembly: LDA $1234, STA $0, LDA $1235, STA $1. 10 bytes of instructions all hiding behind LDWn, and no sign of A being clobbered.

I must say, after two night sleeping on the idea of no A register and I’m having trouble letting go.

I think perhaps we need a little benchmark to see this idea in action. Some algorithm like a 48-bit hash function that clearly doesn’t fit into an 8-bit A register, not even using X and Y to hold intermediate values. Something that takes 8-10 lines of these LDRn, STwn, ADWn pseudo mnemonics, but then translates into 30-40 lines of much-harder-to-understand 6502 assembler.

65LUN02 · Post by **65LUN02** » Sat Aug 27, 2022 1:42 am

Let's try bubble sort as a nice simple benchmark. Sufficient complex that it takes two pages of code, and in this case, I'm using 24-bit values, too big for even SWEET16 but not too big. The following traditional 6502 asm isn't optimized for size or speed, but it works:

Code: Select all

; 10x 24-bit values are loaded into $400-$41D
; (conveniently visible on one line on my Apple2 emulator screen)
outer_loop:
    lda #00             ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
    sta $FF
    ldx #0              ; Loop forward so that items move from left to right
loop_X:
    lda $400,X
    cmp $403,X          ; N[X][0] ?= N[X+1][0]
    beq    +byte2       ; IF (X+1 == X) THEN check next byte
    bcs    +swap        ; IF (X+1 < X) THEN swap
    clv
    bvc +next
byte2:
    lda $401,X
    cmp $404,X
    beq    +byte3       ; IF (X+1 == X) THEN check the next byte
    bcs    +swap        ; IF (X+1 < X) THEN swap
    clv
    bvc +next
byte3:
    lda $402,X
    cmp $405,X
    beq    +next        ; IF (X+1 == X) THEN no swap
    bcc    +next        ; IF (X+1 < X) THEN next loop
swap:
    lda $400,X            ; $400/1/2,X -> $00/1/2
    sta $00
    lda $401,X
    sta $01
    lda $402,X
    sta $02
    lda $403,X            ; $403/4/5,X -> $400/1/2,X
    sta $400,X
    lda $404,X
    sta $401,X
    lda $405,X
    sta $402,X
    lda $00               ; $00/1/2 -> $403/4/5,X
    sta $403,X
    lda $01
    sta $404,X
    lda $02
    sta $405,X
    lda #$01              ; No, the data isn't sorted
    sta $FF
next:
    inx
    inx
    inx
    cpx #27               ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
    bne -loop_X
    lda $FF
    bne -outer_loop
    rts

Next comes a version that uses R/W/T0-255 registers, the pseudo-mnemonics from earlier in this thread, and no explicit use of the A register, treating it like it doesn't exist:

Code: Select all

outer_loop:
    ld_r255 #0             ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
    ldx #0                 ; Loop forward so that items move from left to right
loop_X:
    ld_t0 $400,X
    ld_t3 $403,X
    cmpt0 t3               ; Compare all three bytes in one mnemonic
    bcc +next              ; IF (X+1 >= X) THEN next loop
swap:
    ld_t0 $400,X
    ld_t3 $403,X
    st_t0 $403,X
    st_t3 $400,X
    ld_r255 #1             ; No, the data isn't sorted
  jsr wait_for_key
next:
    inx
    inx
    inx
    cpx #27                ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
    bne -loop_X
    cmpr255 #0
    bne -outer_loop
    rts

The new version is much much shorter in terms of line of code, and with it a bit more understandable, but like every higher-level language, the tradeoff is that it translates into more opcodes. For example, the four lines: ld_t0 $400,X; ld_t3 $403,X; st_t0 $403,X; st_t3 $400,X each expand to six line of assembly:

Code: Select all

00f02a                   ; ld_t0 $0400,X
00f02a bd 00 04           lda $0400,X
00f02d 85 00              sta $00
00f02f bd 01 04           lda $0401,X
00f032 85 01              sta $01
00f034 bd 02 04           lda $0402,X
00f037 85 02              sta $02
00f039                   ; ld_t3 $0403,X
00f039 bd 03 04           lda $0403,X
00f03c 85 03              sta $03
00f03e bd 04 04           lda $0404,X
00f041 85 04              sta $04
00f043 bd 05 04           lda $0405,X
00f046 85 05              sta $05
00f048                   ; st_t0 $0403,X
00f048 a5 00              lda $00
00f04a 9d 03 04           sta $0403,X
00f04d a5 01              lda $01
00f04f 9d 04 04           sta $0404,X
00f052 a5 02              lda $02
00f054 9d 05 04           sta $0405,X
00f057                   ; st_t3 $0400,X
00f057 a5 03              lda $03
00f059 9d 00 04           sta $0400,X
00f05c a5 04              lda $04
00f05e 9d 01 04           sta $0401,X
00f061 a5 05              lda $05
00f063 9d 02 04           sta $0402,X

That is 24 lines of lda and sta vs. just 18 from the traditional version, as I'm keeping with the idea of loading and storing registers instead of copying memory to memory. That could be optimized either by hand or with a little more code in the assembler.

More importantly, ANDWn and OR_Wn will similarly expand in straightforward manner. ADCWn and SBCTn will have to be done in the right order to ensure the carry bit is properly carried. ROL and ROR will expand a bit more to properly roll the carry all the way around. I've yet to code up any of those, but they look easy.

I did code up LD and ST and the listing above is from my assembler's output. However, I can't finish this little benchmark without tackling the one non-obvious instruction, CMP. CMPRn is trivial. The question is whether its worth implementing CMPWn and CMPTn analogous to CMP, setting the N, Z flags properly across the 16-bit or 24-bit values. Doing that is going to expand into a lot more than two or three opcodes. Doing that is either going to require branches or some EOR logic. And doing that is going to require at least one byte of storage to hold the intermediate flags.

The RISC way to do this to forgo CMP and instead put the register comparison in the Bzz branch instruction. All we need for bubble sort is BCC (as the BEQ in the first version is there to iterate through the three byte comparisons and thus will only appear in the generated opcodes). BCCT0 T3 +skip looks very non 6502. So does LD_R255 #0, but that is a far smaller step than dropping CMP.

Anyone else have a better idea?

BigEd · Post by **BigEd** » Sat Aug 27, 2022 6:19 am

A multi-byte CMP is rather like a multi-byte SBC, except it doesn't save the result. So if you proceed from LSB to MSB, I think you find the flags come out as you expect. (Edit: oops, no, see below!)

(It's tempting to want to CMP from the MSB first, in order to shortcut the result. Maybe it's even easy! I can't quite think it through at present. I think it means exiting the sequence with BNE, because if the two bytes are equal, you need to compare the next two.)

Proxy · Post by **Proxy** » Sat Aug 27, 2022 11:44 am

I also recently finished the Instruction set for my own flavor of 6502 RISC.
But i didn't want to use something like CustomASM to create the instructions from scratch using custom formats and such, i wanted this to be import-able into existing projects using a common assembler... and i choose ca65 (because i'm already using it for everything else).
so after reading through ca65's documentation about macros, i found out that they are somehow both more powerful and limit than i though. either way i did get it working and i also made a small bubble sort algorithm with it (though mine is 8-bit instead of 24-bit, maybe i'll do a rewrite)

the Instruction format and list is here:

Code: Select all

<<<Registers:
R0-R31    - 8-bit General Purpose Registers
W0-W15    - 16-bit General Purpose Registers

the W Registers are made out of pairs of R Registers, R0 and R1 form W0, R2 and R3 form W1, etc

<<<Instruction format:
Rs        - Source Register
Rd        - Destination Register
imm8      - 8-bit Immediate value
imm16     - 16-bit Immediate value

<<<Instruction List:
Load Immediate            - LDI Rd, #imm8/16
Load Memory               - LDM Rd, imm16
Load Register             - LDR Rd, Rs (Rs can only be a W Register)
Store Memory              - STM Rd, imm16
Store Register            - STR Rd, Rs (Rd can only be a W Register)

Push Register             - PSH Rs
Pull Register             - PLL Rd

Add with Carry            - ADR Rd, Rs1, Rs0
Subtract with Carry       - SBR Rd, Rs1, Rs0
Logic AND                 - ANR Rd, Rs1, Rs0
Logic OR                  - ORR Rd, Rs1, Rs0
Logic XOR                 - XRR Rd, Rs1, Rs0
Compare                   - CPR Rs1, Rs0
Rotate Right              - RRR Rd, Rs
Rotate Left               - RLR Rd, Rs

Add with Carry Immediate  - ADI Rd, Rs, #imm8/16
Sub. with Carry Immediate - SBI Rd, Rs, #imm8/16
Logic AND Immediate       - ANI Rd, Rs, #imm8/16
Logic OR Immediate        - ORI Rd, Rs, #imm8/16
Logic XOR Immediate       - XRI Rd, Rs, #imm8/16
Compare Immediate         - CPI Rs, #imm8/16

Something special about LDM, LDR, STM, and STR is that you can use either Index Registers with them. even though LDR and STR are basically just Indirect Addressing, the X Register will be added after the Indirect part, like with (zp),Y. effectively creating the (zp),X Addressing Mode.
Also, each instruction can either be 8 or 16-bit depending on the Registers used, but do note that mixing of Registers is not allowed.
So ADR R6, R9, R10 is valid, but ADR W2, R9, R5 is not. if i allow for register mixing there would be way too many possible register type arrangements for each instruction. and all of those would need their own chunk of code in the macro. and that's too insane for me!

going back to the Compare Instruction thing both of you mentioned, i originally thought about getting rid of Compare completely and just copying what RISC-V does with it's Branch instructions. but ultimately i just decided to go back and make the 16-bit Compare work exactly like the 8-bit one, where it sets the N, C, and Z flags like you'd expect from a single CMP Instruction.
and i did it kinda stupidly by using branches and hardwired LDA # CMP # sequences to set the flags correctly:

Code: Select all

LDA src0 - (REG_OFFS - 1)
CMP src1 - (REG_OFFS - 1)
BCC @islower		; src0 < src1
BNE @ishigher		; src0 > src1
LDA src0 - REG_OFFS
CMP src1 - REG_OFFS
BCC @islower		; src0 < src1
BEQ @issame			; src0 = src1
@ishigher:			; src0 > src1
	LDA #42
	CMP #21
BRA @exit
@issame:
	LDA #42
	CMP #42
BRA @exit
@islower:
	LDA #21
	CMP #42
@exit:

anyways, here is my small Bubble Sort routine. a_input is the Array containing the Bytes, and a_count is the amount of Elements in the Array

Code: Select all

_main:
    LDI W15, #a_input           ; Get the Address of the Array
    
    @outer_loop:
        LDY #0
        LDI R2, #0
        
        @loop:
            LDR R0, W15,Y
            INY
            LDR R1, W15,Y       ; Get 2 consecutive bytes from the Input Array
            
            CPR R0, R1          ; Compare them to eachother
            
            BCS @noswap         ; Branch if R0 >= R1
            STR W15,Y, R0
            DEY
            STR W15,Y, R1       ; Otherwise swap the Values in the Array
            INY
            LDI R2, #1          ; And set R2 to >0 to show that the array wasn't sorted yet
            @noswap:
            
            CPY #a_count - 1    ; Repeat this loop until the last Element has been reached
        BNE @loop
        
        CPI R2, #0              ; If the previous loop didn't swap any values, exit the function
    BNE @outer_loop             ; As that means the Array is sorted
    @exit:
    
_end:

I want to try and port this one to the 65816 next, since it should shrink down the 16-bit operations to basically the same exact size as the 8-bit ones. maybe i can use the smart mode in ca65 to only insert the REP/SEP instructions for the accumulator when they are actually needed instead of forcing them in each macro.
anyways, i attached the include file for ca65 below. the file extension doesn't matter but it's common for ca65 assembly includes to have the .inc extension (but i had to rename it to be able to upload it) there might also still be bugs...

BillG · Post by **BillG** » Sat Aug 27, 2022 11:45 am

BigEd wrote:

A multi-byte CMP is rather like a multi-byte SBC, except it doesn't save the result. So if you proceed from LSB to MSB, I think you find the flags come out as you expect.

Only if you do not care about equality.

The ATMEL AVR is very good about this. http://ww1.microchip.com/downloads/en/d ... manual.pdf

The CP (compare, pg 77) instruction is what you would expect. It is used to compare the least-significant pair of bytes.

The CPC (compare with carry, pg 79) instruction is the magic. The Z flag is cleared if the bytes are different; it is not affected when the bytes are the same allowing an equal status to ripple through as successive bytes are processed.

BigEd wrote:

(It's tempting to want to CMP from the MSB first, in order to shortcut the result. Maybe it's even easy! I can't quite think it through at present. I think it means exiting the sequence with BNE, because if the two bytes are equal, you need to compare the next two.)

For numbers larger than two bytes in size, it is better to compare from high byte to low. Only if the numbers are nearly equal do all of the bytes need to be processed.

BigEd · Post by **BigEd** » Sat Aug 27, 2022 12:49 pm

Ah yes, I stand corrected - if you do it LSB first with a multi-byte subtract, you need a multi-byte zero test at the end, because the sign of the MSB of the result is right, but it could be zero without the whole subtraction coming out zero. That's painful because the multi-byte zero test needs information to be preserved across the whole process.

Sean · Post by **Sean** » Mon Aug 29, 2022 1:41 am

Proxy wrote:

I also recently finished the Instruction set for my own flavor of 6502 RISC.

Interesting. The macros allow a RISC-like syntax to be used for assembly, but this isn't a VM or a hardware approach. With the price of memory these days the extra machine code generated probably isn't going to be a problem. Have you done any tests to compare your bubble sort using the RISC macros vs. straight assembly, in terms of speed or code density?

65LUN02 · Post by **65LUN02** » Tue Aug 30, 2022 7:27 pm

BigEd wrote:

you need a multi-byte zero test at the end, because the sign of the MSB of the result is right, but it could be zero without the whole subtraction coming out zero. That's painful because the multi-byte zero test needs information to be preserved across the whole process.

Exactly. Z = Z1 & Z2 & … Zn for an n-byte long operation whereas C gets carried from byte to byte and N only matters for the MSB. V too only cares about the MSB but no one seems to care about V.

The simplest fix is probably to just have a pseudo-op CMPR/W/T that only promises C and N, and then a separate CPZR/W/T pseudo-opcode that only promises to set Z.

TheBrewThatIsTrue · Post by **TheBrewThatIsTrue** » Tue Aug 30, 2022 10:42 pm

I think it's a false dichotomy, recall that this was a time when hand assembly was still commonplace. Regardless of what the tools, or even the documentation called something, in the end what mattered to competent low-level programmers were the capabilities, affordances, and tradeoffs on offer from the hardware. Only neophytes would have been overly influenced by syntax and naming conventions, and one doesn't remain a neophyte for long.

65LUN02 · Post by **65LUN02** » Sat Sep 24, 2022 6:49 pm

TheBrewThatIsTrue wrote:

I think it's a false dichotomy, recall that this was a time when hand assembly was still commonplace. Regardless of what the tools, or even the documentation called something, in the end what mattered to competent low-level programmers were the capabilities, affordances, and tradeoffs on offer from the hardware. Only neophytes would have been overly influenced by syntax and naming conventions, and one doesn't remain a neophyte for long.

Maybe. A few month and thousands of lines of assembly code into my Apple II4 project, I’ve managed to remove all but one global from the zero page (the pointer to the current line of text in the screen buffer), otherwise using zp solely as storage for variables within a subroutine.

I’m using and enjoying the %Rnnn syntax for that. Not in the form LDA %Rn but instead assigning variable names at the start of each subroutine (as one would do in the 1970s version of C), or just before a loop if I need both X and Y within the loop.

With 256 “registers”, I have no conflicts. To avoid that, and to avoid having to use the stack at all, I have the lowest level subroutines use %R253 downward (that one global is %R254-255), the mid-level subroutines use %R128 upward, and the top level code uses %R0 upward.

As a bonus, using decimal numbering, using the %Syntax, and mostly avoiding the low numbers makes this all pop out as zero page.

WillisBlackburn · Post by **WillisBlackburn** » Mon Sep 26, 2022 10:55 pm

I tried a few strategies along these lines. My problem was always in figuring out a useful convention for preserving, or not preserving, the "registers" within subroutines.

Let's say there are 32 general-purpose registers, R0-R31, in zero page. Are subroutines allowed to clobber them? If subroutines can just use any register they like, then any function that has to call subroutines doesn't get much benefit from having all the registers available.

Another strategy could be to divide the registers into classes: maybe R0-R7 can be clobbered at any time, and this happens to make them good for passing arguments to and from functions, so we'll use them for that too. And then we'll say that R8-R31 are preserved: any subroutine that wants to use one has to save the current value first and restore it before returning. That would give every function 24 bytes that they use without having to worry about them being clobbered by some called subroutine.

It seems like it would work, but what turns me off is the performance cost of pushing and popping the preserved registers. On the plain vanilla 6502, it's only possible to push and pop A, so if you want to (a) accept a couple of arguments in the A and X registers and (b) assign them to register R8 to use during the function, you have to first save A somewhere (e.g., R0) so you can then use A to preserve R8, and then finally move the value from R0 to R8. Same thing on the return side; needing to use A to restore register R8 makes it hard to also use A to return a value from the function.

I guess you could go all-in and only pass/return values through the R0-R7 registers, but that means another performance hit, since it's much faster to just pass and return values through real registers than through zero page.

I think if I was working with such a system, when it came time to use one of the preserved registers, I'd probably think, I'll just create a new zero-page variable for this, then I don't have to push or pop anything! At the end of the day, if the total size of the local variables used by all the subroutines in your program is less than the number of available zero page bytes, then you can just put each of them in its own zero page location and call it a day.

65LUN02 · Post by **65LUN02** » Mon Sep 26, 2022 11:03 pm

Thanks @WillisBlackburn. Yes, I considered passing args through zp, but came to the same conclusion, that 256 locations wasn't enough if those precious addresses should also be used like registers, plus a few more for the most common of globals.

Over on viewtopic.php?f=2&t=7343 I shared my solution for passing parameters. I ended up with something akin to a manual data stack. I very much like the way that turned out in practice, mostly because it avoids all use of pushing and popping and because the parameters end up in memory addresses computable at assembly time, which means upon a breakpoint I can in seconds look at the parameters without having to unpack a dynamic stack.

And with that solution, I'm left with zp address locations to act like 253 registers (plus one 2-byte global pointer).

BruceRMcF · Post by **BruceRMcF** » Wed Sep 28, 2022 4:18 pm

BigEd wrote:

Ah yes, I stand corrected - if you do it LSB first with a multi-byte subtract, you need a multi-byte zero test at the end, because the sign of the MSB of the result is right, but it could be zero without the whole subtraction coming out zero. That's painful because the multi-byte zero test needs information to be preserved across the whole process.

Quite ... and the multi-byte equality test is faster on its own, since it can shortcut out on the first byte that fails, so rather than having 0= being part of the CMP, it's better to have it be two way < >= with 0= as a distinct operation.

Though I guess you could just count the number of times it fails the byte equality test and if the count is 0, it was an equality ... kind of like "STZ ZNE : ... : - ... : BEQ - : INC ZNE : BRA -"

Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?

Re: Zero page as registers or globals?