Let's try bubble sort as a nice simple benchmark. Sufficient complex that it takes two pages of code, and in this case, I'm using 24-bit values, too big for even SWEET16 but not too big. The following traditional 6502 asm isn't optimized for size or speed, but it works:
Code:
; 10x 24-bit values are loaded into $400-$41D
; (conveniently visible on one line on my Apple2 emulator screen)
outer_loop:
lda #00 ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
sta $FF
ldx #0 ; Loop forward so that items move from left to right
loop_X:
lda $400,X
cmp $403,X ; N[X][0] ?= N[X+1][0]
beq +byte2 ; IF (X+1 == X) THEN check next byte
bcs +swap ; IF (X+1 < X) THEN swap
clv
bvc +next
byte2:
lda $401,X
cmp $404,X
beq +byte3 ; IF (X+1 == X) THEN check the next byte
bcs +swap ; IF (X+1 < X) THEN swap
clv
bvc +next
byte3:
lda $402,X
cmp $405,X
beq +next ; IF (X+1 == X) THEN no swap
bcc +next ; IF (X+1 < X) THEN next loop
swap:
lda $400,X ; $400/1/2,X -> $00/1/2
sta $00
lda $401,X
sta $01
lda $402,X
sta $02
lda $403,X ; $403/4/5,X -> $400/1/2,X
sta $400,X
lda $404,X
sta $401,X
lda $405,X
sta $402,X
lda $00 ; $00/1/2 -> $403/4/5,X
sta $403,X
lda $01
sta $404,X
lda $02
sta $405,X
lda #$01 ; No, the data isn't sorted
sta $FF
next:
inx
inx
inx
cpx #27 ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
bne -loop_X
lda $FF
bne -outer_loop
rts
Next comes a version that uses R/W/T0-255 registers, the pseudo-mnemonics from earlier in this thread, and no explicit use of the A register, treating it like it doesn't exist:
Code:
outer_loop:
ld_r255 #0 ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
ldx #0 ; Loop forward so that items move from left to right
loop_X:
ld_t0 $400,X
ld_t3 $403,X
cmpt0 t3 ; Compare all three bytes in one mnemonic
bcc +next ; IF (X+1 >= X) THEN next loop
swap:
ld_t0 $400,X
ld_t3 $403,X
st_t0 $403,X
st_t3 $400,X
ld_r255 #1 ; No, the data isn't sorted
jsr wait_for_key
next:
inx
inx
inx
cpx #27 ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
bne -loop_X
cmpr255 #0
bne -outer_loop
rts
The new version is much much shorter in terms of line of code, and with it a bit more understandable, but like every higher-level language, the tradeoff is that it translates into more opcodes. For example, the four lines:
ld_t0 $400,X; ld_t3 $403,X; st_t0 $403,X; st_t3 $400,X each expand to six line of assembly:
Code:
00f02a ; ld_t0 $0400,X
00f02a bd 00 04 lda $0400,X
00f02d 85 00 sta $00
00f02f bd 01 04 lda $0401,X
00f032 85 01 sta $01
00f034 bd 02 04 lda $0402,X
00f037 85 02 sta $02
00f039 ; ld_t3 $0403,X
00f039 bd 03 04 lda $0403,X
00f03c 85 03 sta $03
00f03e bd 04 04 lda $0404,X
00f041 85 04 sta $04
00f043 bd 05 04 lda $0405,X
00f046 85 05 sta $05
00f048 ; st_t0 $0403,X
00f048 a5 00 lda $00
00f04a 9d 03 04 sta $0403,X
00f04d a5 01 lda $01
00f04f 9d 04 04 sta $0404,X
00f052 a5 02 lda $02
00f054 9d 05 04 sta $0405,X
00f057 ; st_t3 $0400,X
00f057 a5 03 lda $03
00f059 9d 00 04 sta $0400,X
00f05c a5 04 lda $04
00f05e 9d 01 04 sta $0401,X
00f061 a5 05 lda $05
00f063 9d 02 04 sta $0402,X
That is 24 lines of
lda and
sta vs. just 18 from the traditional version, as I'm keeping with the idea of loading and storing registers instead of copying memory to memory. That could be optimized either by hand or with a little more code in the assembler.
More importantly,
ANDWn and
OR_Wn will similarly expand in straightforward manner.
ADCWn and
SBCTn will have to be done in the right order to ensure the carry bit is properly carried.
ROL and
ROR will expand a bit more to properly roll the carry all the way around. I've yet to code up any of those, but they look easy.
I did code up
LD and
ST and the listing above is from my assembler's output. However, I can't finish this little benchmark without tackling the one non-obvious instruction,
CMP.
CMPRn is trivial. The question is whether its worth implementing
CMPWn and
CMPTn analogous to CMP, setting the N, Z flags properly across the 16-bit or 24-bit values. Doing that is going to expand into a lot more than two or three opcodes. Doing that is either going to require branches or some EOR logic. And doing that is going to require at least one byte of storage to hold the intermediate flags.
The RISC way to do this to forgo CMP and instead put the register comparison in the Bzz branch instruction. All we need for bubble sort is BCC (as the BEQ in the first version is there to iterate through the three byte comparisons and thus will only appear in the generated opcodes).
BCCT0 T3 +skip looks very non 6502. So does
LD_R255 #0, but that is a far smaller step than dropping CMP.
Anyone else have a better idea?