Thus the 65C2424 would be roughly 15,000 transistors while an NMOS 652525 back in the late 1970s would be only 6,750 or so transistors.
The 65M202 road map that could have been
Re: The 65M202 road map that could have been
Sean wrote:
65LUN02 wrote:
Where do you get 10K transistors for the ARM1?
Thus the 65C2424 would be roughly 15,000 transistors while an NMOS 652525 back in the late 1970s would be only 6,750 or so transistors.
Re: The 65M202 road map that could have been
Sean wrote:
I think I read somewhere that there were later enhancements to allow separate 64K address space for data, 64K address space for program, per user, but I may be mistaken.
https://www.computercollection.net/pdp11/index.htm.
Re: The 65M202 road map that could have been
BigEd wrote:
What I've found in previous adventures is that an emulator and an assembler are quite handy at this stage. Failing the emulator, if one uses hardware, then a monitor.
I'm much more of a C than assembler coder, and while playing with https://github.com/lunarmobiscuit/izapple2/ and it's 6502 emulator https://github.com/lunarmobiscuit/iz6502/ to create an emulated "Apple ][4", it seemed easier to write an assembler/compiler that mixes C-like blocks, ifs, loops, and variables with assembly mnemonics than to stick solely with assembly or jump all the way to C.
The result of that is https://github.com/lunarmobiscuit/pomme/. So far it's just an assembler, but with {} blocks for subroutines and data. https://github.com/lunarmobiscuit/pomme ... e/test.pom will offend both assembly coders and C coders in its style, but I like where it is headed.
Meanwhile, in terms of the 65C24T8, it is a working assembler, including every address mode, address size, register size, and every opcode I've described prior in this thread along with a few that seemed likely to be needed once I start playing with 16-bit and 24-bit registers: shift left/right 4, shift left/right 8, A+=x, A+=y, A=X+Y.
The number of mnemonics doesn't rise by much, and the opcode space still has quite a few holes, but the number to total possible instructions jumps past 600. See https://github.com/lunarmobiscuit/pomme ... pcodes.pom for every possibility and https://github.com/lunarmobiscuit/pomme ... pcodes.lst for the generated machine code.
The iz6502 emulator currently implements all of those up to SWS (set stack width). I'll add the others soon.
Re: The 65M202 road map that could have been
(Hmm, is that perhaps a private repo?)
Re: The 65M202 road map that could have been
BigEd wrote:
(Hmm, is that perhaps a private repo?)
Having bugs possible in all three layers makes this an interesting project to debug. That was the driving factor toward writing pomme in go, as iz6502 and izapple2 were written in go.
Meanwhile... the functionality of Apple2four.pom now matches the functionality of Apple2four.asm. The only difference so far is that the .pom version uses a few 16-bit and 24-bit registers for accessing the screen. The first few learnings:
- Data tables of addresses are a lot simpler. What was:
Code: Select all
TextScreenBaseL
DC.B $00, $80, $00, $80, $00, $80, $00, $80
DC.B $28, $A8, $28, $A8, $28, $A8, $28, $A8
DC.B $50, $D0, $50, $D0, $50, $D0, $50, $D0
TextScreenBaseH
DC.B $04, $04, $05, $05, $06, $06, $07, $07
DC.B $04, $04, $05, $05, $06, $06, $07, $07
DC.B $04, $04, $05, $05, $06, $06, $07, $07Code: Select all
data TextScreenBase @$FF8000 word {
$0400, $0480, $0500, $0580, $0600, $0680, $0700, $0780
$0428, $04A8, $0528, $05A8, $0628, $06A8, $0728, $07A8
$0450, $04D0, $0550, $05D0, $0650, $06D0, $0750, $07D0
}Code: Select all
lda #<CMD_Clear ; $05/$06/$07 = COMMAND string base address
sta $05
lda #>CMD_Clear
sta $06lda #$FF ; hard-coded to $FF as dasm doesn't have >> to grab the 3rd byte
sta $06
becomes:
Code: Select all
lda.t #CMD_Clear ; $05/$06/$07 = COMMAND string base address
sta.t $05
And the two tiny optimizations I made clear the screen two bytes at a time and scroll the screen two bytes at a time. Both of those make the code longer, as I have to iny, iny twice per loop, but in terms of cycles, those loops only have run loop half as many times. 20 times per row instead of 40.
Finally, the only bug introduced in these changes was in pulling the address from that TextScreenBase data table. In the original, X held the row number, and TextScreenBase,X pulled the correct address from the byte-sized table. Now X needs to be twice the row number. So like in ClearScreen, where writing two bytes at a time requires the Y index to be iny iny twice, the X index in the ClearScreen loop needs inx, inx twice too, and the cpx #24 changed to cpx #48 to match.
Which then makes me wonder... given a 16-bit 6502, does it then make sense to have a prefix code to make ,X and ,Y double the value in X or Y prior to using it as an index? That is simple to implement in hardware, as it's just bits 6:1 of the register with a zero in the lsb. But then similarly, for 24-bit operations, a prefix code to pre-compute X*3 and Y*3 would be just as helpful, and computing X*3=X<<1+X is an extra cycle. That seems a slippery slope to head down, so perhaps just xsl, ysl as more orthogonal to asl, lsr and more generally useful
Re: The 65M202 road map that could have been
65LUN02 wrote:
Which then makes me wonder... given a 16-bit 6502, does it then make sense to have a prefix code to make ,X and ,Y double the value in X or Y prior to using it as an index? That is simple to implement in hardware, as it's just bits 6:1 of the register with a zero in the lsb. But then similarly, for 24-bit operations, a prefix code to pre-compute X*3 and Y*3 would be just as helpful, and computing X*3=X<<1+X is an extra cycle. That seems a slippery slope to head down, so perhaps just xsl, ysl as more orthogonal to asl, lsr and more generally useful
Re: The 65M202 road map that could have been
Sean wrote:
65LUN02 wrote:
With respect to indexed use: Do you do byte reads from 16-bit memory or just word reads, which might require the LSB to distinguish between first and second byte in the word? Is that perhaps tangential to the question at hand, since the LSB could be ignored in address generation and only used for routing one byte or the other to register.
The registers are 1, 2, or 3 bytes wide, but the state machine ends up iterating over extra cycles loading the extra bytes.
Thus lda.w $00 issues two memory reads: $00 and $01. A ends up with the little endian value [$01][$00]
Similarly lda.t $1234 issues three memory reads with the value [$1236][$1235]$1234] stored in register A
With an 8-bit bus, there is no concept of word alignment as seen on the 68000 or RISC chips. You can store to $1234 and then load 1, 2, or 3 bytes from $1233 or #1235 if you want.
I can see that being a problem if some later version of this chip moved to a 16-bit or 24-bit wider data bus, but by the 1980s the CPU clock speeds were faster than memory speeds, and thus by then one would expect a cache between the CPU and memory, and the cache could deal with unaligned requests, as the cache would be loading more than 2 or 3 bytes per load.
Re: The 65M202 road map that could have been
Thanks for sharing! Interesting ideas indeed.
Re: The 65M202 road map that could have been
Over on viewtopic.php?f=2&t=7304 we're talking about whether zero page would be treated more like addressable registers if the assembly mnemonics treated them like registers instead of 8-bit addressable memory. In testing out those ideas, I whipped up a bubble sort that sorts 24-bit values. Which, of course, makes me want to see for this thread what happens if that algorithm is coded up with 24-bit registers.
The traditional (non-optimized) 6502 version is:
53 lines of code (including the six labels), 108 bytes of machine code
To create the 24-bit register version, I just took the code, added a few .t's, and took out all the parts that iterate over the three individual bytes. In debugging I did notice that the traditional version is comparing big-endian values instead of little-endian values, but I didn't fix that bug.
Only 28 lines of code and only 60 byes of machine code.
The choice of 24-bit values certainly helped in that comparison. 16-bit values would have been the smaller for the traditional version and the same 60 bytes for the larger registers, but 32-bit values or wider would require comparisons in chunks for both CPUs, but in 16-bit or 24-bit chunks for the 652402.
I chose bubble sort as it's a whole page of code in length, and as the memory swap can't be done with just A, X, and Y. That swap requires either using zp or the stack. It thus shows off how much more efficient a 6502 can be if such operations can be done 2 or 3 bytes at a time, even if underneath the machine code the CPU is actually just iterating over bytes on an 8-bit data bus.
My assembler doesn't count total cycles, but the wider register version is certainly going to be fewer cycles, as if nothing else it eliminates all the branches in comparing one byte at a time vs. three bytes in one shot.
I'd love to see this same unoptimized algorithm in 65C816 assembly too for comparison. Can someone who knows it post it?
The traditional (non-optimized) 6502 version is:
Code: Select all
ldy #0
loop_Y:
lda #00 ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
sta $FF
ldx #0 ; Loop forward so that items move from left to right
loop_X:
lda $400,X
cmp $403,X ; N[X][0] ?= N[X+1][0]
beq +byte2 ; IF (X+1 == X) THEN check next byte
bcs +swap ; IF (X+1 < X) THEN swap
clv
bvc +next
byte2:
lda $401,X
cmp $404,X
beq +byte3 ; IF (X+1 == X) THEN check the next byte
bcs +swap ; IF (X+1 < X) THEN swap
clv
bvc +next
byte3:
lda $402,X
cmp $405,X
beq +next ; IF (X+1 == X) THEN no swap
bcc +next ; IF (X+1 < X) THEN next loop
swap:
lda $400,X ; $400/1/2,X -> $00/1/2
sta $00
lda $401,X
sta $01
lda $402,X
sta $02
lda $403,X ; $403/4/5,X -> $400/1/2,X
sta $400,X
lda $404,X
sta $401,X
lda $405,X
sta $402,X
lda $00 ; $00/1/2 -> $403/4/5,X
sta $403,X
lda $01
sta $404,X
lda $02
sta $405,X
inc 255 ; No, the data isn't sorted
next:
inx
inx
inx
cpx #27 ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
bne -loop_X
lda $FF
bne -loop_Y
rtsTo create the 24-bit register version, I just took the code, added a few .t's, and took out all the parts that iterate over the three individual bytes. In debugging I did notice that the traditional version is comparing big-endian values instead of little-endian values, but I didn't fix that bug.
Code: Select all
00d800 a0 00 ldy #$0
00d802 loop_y:
00d802 a9 00 lda #$0
00d804 85 ff sta $ff
00d806 a2 00 ldx #$0
00d808 loop_x:
00d808 20 30 f8 jsr $f830
00d80b 2f bd 00 04 lda.t $000400,X
00d80f 2f dd 03 04 cmp.t $000403,X
00d813 f0 1a beq +26
00d815 90 18 bcc +24
00d817 swap:
00d817 2f bd 00 04 lda.t $000400,X
00d81b 2f 85 00 sta.t $00
00d81e 2f bd 03 04 lda.t $000403,X
00d822 2f 9d 00 04 sta.t $000400,X
00d826 2f a5 00 lda.t $00
00d829 2f 9d 03 04 sta.t $000403,X
00d82d e6 ff inc $ff
00d82f next:
00d82f e8 inx
00d830 e8 inx
00d831 e8 inx
00d832 e0 1b cpx #$1b
00d834 d0 d2 bne -46
00d836 test:
00d836 c8 iny
00d837 a5 ff lda $ff
00d839 d0 c7 bne -57
00d83b done:
00d83b 60 rts
The choice of 24-bit values certainly helped in that comparison. 16-bit values would have been the smaller for the traditional version and the same 60 bytes for the larger registers, but 32-bit values or wider would require comparisons in chunks for both CPUs, but in 16-bit or 24-bit chunks for the 652402.
I chose bubble sort as it's a whole page of code in length, and as the memory swap can't be done with just A, X, and Y. That swap requires either using zp or the stack. It thus shows off how much more efficient a 6502 can be if such operations can be done 2 or 3 bytes at a time, even if underneath the machine code the CPU is actually just iterating over bytes on an 8-bit data bus.
My assembler doesn't count total cycles, but the wider register version is certainly going to be fewer cycles, as if nothing else it eliminates all the branches in comparing one byte at a time vs. three bytes in one shot.
I'd love to see this same unoptimized algorithm in 65C816 assembly too for comparison. Can someone who knows it post it?
Re: The 65M202 road map that could have been
After coding a few thousand lines of assembly for an emulated 652402, I've found one oddity that could use some advice.
LDA.w $(zp),Y does what you would expect, loading a 16-bit value from the indirect indexed address. There is no question that having 16-bit and 24-bit values makes coding realworld code a whole lot easier.
The problem is, sometimes you really just want one byte. Like when you are processing ASCII or when you are copying a buffer that can be an odd number of bytes. The problem is LDA $(zp),Y is ambiguous when it comes to the size of the address hiding behind $zp and the size of the Y register that is added to that address.
What I realized in debugging a buffer copying loop is that the size of X and Y in all the indexed address modes must be defined by the width of the addresses in those instructions, not by the specified width of the registers. So it's LDX.w $1000 to load 16-bits into X, but then it's LDA.a24 $F0000,X to load one byte from $F1000 and LDA.a24.t $F0000,X to load three bytes from that address, in both cases the .24 suffix ensuring that the absolute address is 24-bits as is the X register.
That works fine when the address in the instruction is absolute or indirect, but the $zp modes are 8-bit addresses, not 16-bit or 24-bit. I've not found a way to make this work in a way that feel elegant. And the more code I write, the more I find the ($zp),Y address mode useful and irreplaceable.
Do you see an better solution?
Meanwhile... what became more and more obvious as I get used to this odd CPU is that I'm often moving addresses into zp for use in that ($zp),Y address mode. Most often these are addresses stored in tables. Given I can load 24-bits into A, X, or Y in a single mnemonic, it feels like an extra step to then save to zp rather than just use X and Y as a base address and index directly.
Today I added three new addressing modes LDA XY, LDA (XY), and LDA (X),Y. The latter specifically to replace ($zp),Y. XY just adds X+Y to create the target address. (XY)]/b] adds X+Y and indirectly loads the address from that address. (X),Y]/b] loads an address from X, then adds Y, then loads from that aggregate address. All three can be 16-bit or 24-bit addresses, and like before, the width of X and Y are driven by that choice, separately from whether the value being loaded is 8-bit, 16-bit, or 32-bit.
The only code I've written so far for these are tests to make sure the emulator is working correctly. Next up is refactoring all my code to replace zp with (X),Y. I'm curious to see if the other two modes will get much use, but with an emulator in Go, it just takes a few minutes to add in an instruction to play with.
Thoughts? Does this stray too far from the 65xx mojo or does this feel like a natural extension to a never-built 652402 in my alternate reality where the 68000 didn't disrupt the 8-bit world?
LDA.w $(zp),Y does what you would expect, loading a 16-bit value from the indirect indexed address. There is no question that having 16-bit and 24-bit values makes coding realworld code a whole lot easier.
The problem is, sometimes you really just want one byte. Like when you are processing ASCII or when you are copying a buffer that can be an odd number of bytes. The problem is LDA $(zp),Y is ambiguous when it comes to the size of the address hiding behind $zp and the size of the Y register that is added to that address.
What I realized in debugging a buffer copying loop is that the size of X and Y in all the indexed address modes must be defined by the width of the addresses in those instructions, not by the specified width of the registers. So it's LDX.w $1000 to load 16-bits into X, but then it's LDA.a24 $F0000,X to load one byte from $F1000 and LDA.a24.t $F0000,X to load three bytes from that address, in both cases the .24 suffix ensuring that the absolute address is 24-bits as is the X register.
That works fine when the address in the instruction is absolute or indirect, but the $zp modes are 8-bit addresses, not 16-bit or 24-bit. I've not found a way to make this work in a way that feel elegant. And the more code I write, the more I find the ($zp),Y address mode useful and irreplaceable.
Do you see an better solution?
Meanwhile... what became more and more obvious as I get used to this odd CPU is that I'm often moving addresses into zp for use in that ($zp),Y address mode. Most often these are addresses stored in tables. Given I can load 24-bits into A, X, or Y in a single mnemonic, it feels like an extra step to then save to zp rather than just use X and Y as a base address and index directly.
Today I added three new addressing modes LDA XY, LDA (XY), and LDA (X),Y. The latter specifically to replace ($zp),Y. XY just adds X+Y to create the target address. (XY)]/b] adds X+Y and indirectly loads the address from that address. (X),Y]/b] loads an address from X, then adds Y, then loads from that aggregate address. All three can be 16-bit or 24-bit addresses, and like before, the width of X and Y are driven by that choice, separately from whether the value being loaded is 8-bit, 16-bit, or 32-bit.
The only code I've written so far for these are tests to make sure the emulator is working correctly. Next up is refactoring all my code to replace zp with (X),Y. I'm curious to see if the other two modes will get much use, but with an emulator in Go, it just takes a few minutes to add in an instruction to play with.
Thoughts? Does this stray too far from the 65xx mojo or does this feel like a natural extension to a never-built 652402 in my alternate reality where the 68000 didn't disrupt the 8-bit world?
Re: The 65M202 road map that could have been
It feels reasonable to me - as a matter of ISA design - once a register can hold an address, to be able to use it as an address.
In fact once you've done that, you might find you have less need of indirection through memory.
Whether this is 6502ish is another question!
In fact once you've done that, you might find you have less need of indirection through memory.
Whether this is 6502ish is another question!
Re: The 65M202 road map that could have been
65LUN02 wrote:
What I realized in debugging a buffer copying loop is that the size of X and Y in all the indexed address modes must be defined by the width of the addresses in those instructions, not by the specified width of the registers.
On the 65020, I have the index registers always adding 32 bits regardless of everything else. And I very often turn the 6502 indexed addressing upside-down: instead of adding a small variable offset (the index register) to a fixed base address, it's adding a small constant offset to a variable address. Routines that access data in memory don't need it to be in a single fixed location, but can take a pointer to it in a register.
This re-interpretation of indexed addressing has meant that I have not needed to use the indirect modes at all. If I did, (zp, X) would be the useful one.
My CPU is rather different from yours. I have a lot more registers, for a start. That means it's possible for code to keep data in registers, which means I'm not using memory as a scratchpad. This approach might not be as practical on yours.
Quote:
Does this stray too far from the 65xx mojo
Re: The 65M202 road map that could have been
John West wrote:
65LUN02 wrote:
On the 65020, I have the index registers always adding 32 bits regardless of everything else. And I very often turn the 6502 indexed addressing upside-down: instead of adding a small variable offset (the index register) to a fixed base address, it's adding a small constant offset to a variable address. Routines that access data in memory don't need it to be in a single fixed location, but can take a pointer to it in a register.
My immediate reaction is to modify the compiler to make that upside-down format right-side up, adding a syntax like LDA Y,+vvvvvv using the + on purpose to make it obvious that it's an offset, not an address.
As I refactor my code to swap out the LDA ($zp),Y's, I'll try a few your way and try a few with LDA (X),Y and see if having the offset makes the loops easier to write and later, read.
Thank you.
Re: The 65M202 road map that could have been
Turns out all that is needed is LDA XY.
LDA XY is the same as LDA (zp),Y but without the need to STA into zp.
Any further need of indirection can be done by LDA XY, TAX, and another LDA XY
But mostly the pattern I need is LDX from, LDA X followed by LDX to, STA X to move a sequence of bytes from one place to another, sometime with the same Y index and sometimes with LDY grabbing the matching index for each sequence.
And while LDA $0,X is equivalent to a LDA X mnemonic, avoiding having to LDY #0, I had enough cases where I would have used a LDA X instruction that I traded off the one opcode and added that to my CPU emulator rather than having the assembler turn it into LDA $0,X. That makes the code one byte smaller per use and avoids having to leave #$00 in zp $00.
LDA XY is the same as LDA (zp),Y but without the need to STA into zp.
Any further need of indirection can be done by LDA XY, TAX, and another LDA XY
But mostly the pattern I need is LDX from, LDA X followed by LDX to, STA X to move a sequence of bytes from one place to another, sometime with the same Y index and sometimes with LDY grabbing the matching index for each sequence.
And while LDA $0,X is equivalent to a LDA X mnemonic, avoiding having to LDY #0, I had enough cases where I would have used a LDA X instruction that I traded off the one opcode and added that to my CPU emulator rather than having the assembler turn it into LDA $0,X. That makes the code one byte smaller per use and avoids having to leave #$00 in zp $00.