BigEd wrote:
(Hmm, is that perhaps a private repo?)
It was, but that is now fixed. And the bug fixes through the three levels of assembler, CPU emulator, Apple2 emulator are committed too.
Having bugs possible in all three layers makes this an interesting project to debug. That was the driving factor toward writing pomme in go, as iz6502 and izapple2 were written in go.
Meanwhile... the functionality of Apple2four.pom now matches the functionality of Apple2four.asm. The only difference so far is that the .pom version uses a few 16-bit and 24-bit registers for accessing the screen. The first few learnings:
- Data tables of addresses are a lot simpler. What was:
Code:
TextScreenBaseL
DC.B $00, $80, $00, $80, $00, $80, $00, $80
DC.B $28, $A8, $28, $A8, $28, $A8, $28, $A8
DC.B $50, $D0, $50, $D0, $50, $D0, $50, $D0
TextScreenBaseH
DC.B $04, $04, $05, $05, $06, $06, $07, $07
DC.B $04, $04, $05, $05, $06, $06, $07, $07
DC.B $04, $04, $05, $05, $06, $06, $07, $07
becomes:
Code:
data TextScreenBase @$FF8000 word {
$0400, $0480, $0500, $0580, $0600, $0680, $0700, $0780
$0428, $04A8, $0528, $05A8, $0628, $06A8, $0728, $07A8
$0450, $04D0, $0550, $05D0, $0650, $06D0, $0750, $07D0
}
- Pomme doesn't have < and >, as e.g.:
Code:
lda #<CMD_Clear ; $05/$06/$07 = COMMAND string base address
sta $05
lda #>CMD_Clear
sta $06
lda #$FF ; hard-coded to $FF as dasm doesn't have >> to grab the 3rd byte
sta $06[/code]
becomes:
Code:
lda.t #CMD_Clear ; $05/$06/$07 = COMMAND string base address
sta.t $05
In this 480 lines of 6502 assembly, the original compiles into 1,355 bytes of code. All the /**/s and {}s grow the assembly to 490 lines, but it compiles to only 1,318 bytes of code. So despite the prefix codes, the handful of lines of code loading and storing 16-bit and 24-bit addresses saved 37 bytes.
And the two tiny optimizations I made clear the screen two bytes at a time and scroll the screen two bytes at a time. Both of those make the code longer, as I have to iny, iny twice per loop, but in terms of cycles, those loops only have run loop half as many times. 20 times per row instead of 40.
Finally, the only bug introduced in these changes was in pulling the address from that TextScreenBase data table. In the original, X held the row number, and TextScreenBase,X pulled the correct address from the byte-sized table. Now X needs to be twice the row number. So like in ClearScreen, where writing two bytes at a time requires the Y index to be iny iny twice, the X index in the ClearScreen loop needs inx, inx twice too, and the cpx #24 changed to cpx #48 to match.
Which then makes me wonder... given a 16-bit 6502, does it then make sense to have a prefix code to make ,X and ,Y double the value in X or Y prior to using it as an index? That is simple to implement in hardware, as it's just bits 6:1 of the register with a zero in the lsb. But then similarly, for 24-bit operations, a prefix code to pre-compute X*3 and Y*3 would be just as helpful, and computing X*3=X<<1+X is an extra cycle. That seems a slippery slope to head down, so perhaps just xsl, ysl as more orthogonal to asl, lsr and more generally useful