6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:51 pm

All times are UTC




Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4
Author Message
PostPosted: Mon Jul 25, 2022 8:15 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Sean wrote:
65LUN02 wrote:
Where do you get 10K transistors for the ARM1?


Sorry, probably bad memory, or my memory was confusing gates and transitions from a page other than https://en.wikipedia.org/wiki/Transistor_count. CMOS gates are double the transistors than NMOS or PMOS gates. Wikipedia says 25,000 *transistors* for the ARM 1, and 11,500 for the WDC 65C02. The NMOS MOS6502 was 4,528 transistors.

Thus the 65C2424 would be roughly 15,000 transistors while an NMOS 652525 back in the late 1970s would be only 6,750 or so transistors.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 02, 2022 11:04 pm 
Offline

Joined: Mon Feb 15, 2021 2:11 am
Posts: 100
Sean wrote:
I think I read somewhere that there were later enhancements to allow separate 64K address space for data, 64K address space for program, per user, but I may be mistaken.


Digging a bit online, I found out that the PDP-11/45 introduced the split instruction/data spaces to the PDP-11 series, according to https://gunkies.org/wiki/PDP-11/45 and tangentially mentioned by
https://www.computercollection.net/pdp11/index.htm.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 21, 2022 6:45 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
BigEd wrote:
What I've found in previous adventures is that an emulator and an assembler are quite handy at this stage. Failing the emulator, if one uses hardware, then a monitor.


I managed to get dasm https://github.com/lunarmobiscuit/dasm to handle addresses above $FFFF, and to output the A24 prefix code when referencing addresses above $FFFF. But adding variable register widths to that assembler didn't look easy.

I'm much more of a C than assembler coder, and while playing with https://github.com/lunarmobiscuit/izapple2/ and it's 6502 emulator https://github.com/lunarmobiscuit/iz6502/ to create an emulated "Apple ][4", it seemed easier to write an assembler/compiler that mixes C-like blocks, ifs, loops, and variables with assembly mnemonics than to stick solely with assembly or jump all the way to C.

The result of that is https://github.com/lunarmobiscuit/pomme/. So far it's just an assembler, but with {} blocks for subroutines and data. https://github.com/lunarmobiscuit/pomme/blob/main/compile/test.pom will offend both assembly coders and C coders in its style, but I like where it is headed.

Meanwhile, in terms of the 65C24T8, it is a working assembler, including every address mode, address size, register size, and every opcode I've described prior in this thread along with a few that seemed likely to be needed once I start playing with 16-bit and 24-bit registers: shift left/right 4, shift left/right 8, A+=x, A+=y, A=X+Y.

The number of mnemonics doesn't rise by much, and the opcode space still has quite a few holes, but the number to total possible instructions jumps past 600. See https://github.com/lunarmobiscuit/pomme/blob/main/compile/opcodes.pom for every possibility and https://github.com/lunarmobiscuit/pomme/blob/main/compile/opcodes.lst for the generated machine code.

The iz6502 emulator currently implements all of those up to SWS (set stack width). I'll add the others soon.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 21, 2022 7:34 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(Hmm, is that perhaps a private repo?)


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 21, 2022 11:32 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
BigEd wrote:
(Hmm, is that perhaps a private repo?)

It was, but that is now fixed. And the bug fixes through the three levels of assembler, CPU emulator, Apple2 emulator are committed too.

Having bugs possible in all three layers makes this an interesting project to debug. That was the driving factor toward writing pomme in go, as iz6502 and izapple2 were written in go.

Meanwhile... the functionality of Apple2four.pom now matches the functionality of Apple2four.asm. The only difference so far is that the .pom version uses a few 16-bit and 24-bit registers for accessing the screen. The first few learnings:

- Data tables of addresses are a lot simpler. What was:
Code:
TextScreenBaseL
   DC.B $00, $80, $00, $80, $00, $80, $00, $80
   DC.B $28, $A8, $28, $A8, $28, $A8, $28, $A8
   DC.B $50, $D0, $50, $D0, $50, $D0, $50, $D0
TextScreenBaseH
   DC.B $04, $04, $05, $05, $06, $06, $07, $07
   DC.B $04, $04, $05, $05, $06, $06, $07, $07
   DC.B $04, $04, $05, $05, $06, $06, $07, $07

becomes:
Code:
data TextScreenBase @$FF8000 word {
   $0400, $0480, $0500, $0580, $0600, $0680, $0700, $0780
   $0428, $04A8, $0528, $05A8, $0628, $06A8, $0728, $07A8
   $0450, $04D0, $0550, $05D0, $0650, $06D0, $0750, $07D0
}


- Pomme doesn't have < and >, as e.g.:
Code:
  lda #<CMD_Clear            ; $05/$06/$07 = COMMAND string base address
  sta $05
  lda #>CMD_Clear
  sta $06

lda #$FF ; hard-coded to $FF as dasm doesn't have >> to grab the 3rd byte
sta $06[/code]
becomes:
Code:
  lda.t #CMD_Clear      ; $05/$06/$07 = COMMAND string base address
  sta.t $05


In this 480 lines of 6502 assembly, the original compiles into 1,355 bytes of code. All the /**/s and {}s grow the assembly to 490 lines, but it compiles to only 1,318 bytes of code. So despite the prefix codes, the handful of lines of code loading and storing 16-bit and 24-bit addresses saved 37 bytes.

And the two tiny optimizations I made clear the screen two bytes at a time and scroll the screen two bytes at a time. Both of those make the code longer, as I have to iny, iny twice per loop, but in terms of cycles, those loops only have run loop half as many times. 20 times per row instead of 40.

Finally, the only bug introduced in these changes was in pulling the address from that TextScreenBase data table. In the original, X held the row number, and TextScreenBase,X pulled the correct address from the byte-sized table. Now X needs to be twice the row number. So like in ClearScreen, where writing two bytes at a time requires the Y index to be iny iny twice, the X index in the ClearScreen loop needs inx, inx twice too, and the cpx #24 changed to cpx #48 to match.

Which then makes me wonder... given a 16-bit 6502, does it then make sense to have a prefix code to make ,X and ,Y double the value in X or Y prior to using it as an index? That is simple to implement in hardware, as it's just bits 6:1 of the register with a zero in the lsb. But then similarly, for 24-bit operations, a prefix code to pre-compute X*3 and Y*3 would be just as helpful, and computing X*3=X<<1+X is an extra cycle. That seems a slippery slope to head down, so perhaps just xsl, ysl as more orthogonal to asl, lsr and more generally useful


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 22, 2022 3:32 am 
Offline

Joined: Mon Feb 15, 2021 2:11 am
Posts: 100
65LUN02 wrote:
Which then makes me wonder... given a 16-bit 6502, does it then make sense to have a prefix code to make ,X and ,Y double the value in X or Y prior to using it as an index? That is simple to implement in hardware, as it's just bits 6:1 of the register with a zero in the lsb. But then similarly, for 24-bit operations, a prefix code to pre-compute X*3 and Y*3 would be just as helpful, and computing X*3=X<<1+X is an extra cycle. That seems a slippery slope to head down, so perhaps just xsl, ysl as more orthogonal to asl, lsr and more generally useful


That's an interesting question. You of course would want X and Y to be able to be used regularly for non-indexed use, such as a loop control variable. With respect to indexed use: Do you do byte reads from 16-bit memory or just word reads, which might require the LSB to distinguish between first and second byte in the word? Is that perhaps tangential to the question at hand, since the LSB could be ignored in address generation and only used for routing one byte or the other to register.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 22, 2022 4:18 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Sean wrote:
65LUN02 wrote:
With respect to indexed use: Do you do byte reads from 16-bit memory or just word reads, which might require the LSB to distinguish between first and second byte in the word? Is that perhaps tangential to the question at hand, since the LSB could be ignored in address generation and only used for routing one byte or the other to register.


I'm still using the premise from the original post in this thread, that this 652402/65C24T8 design is something that could have been created in the late 1970s. I've thus stuck with an 8-bit data bus to keep the number of pins <= 48 and to make it possible to drop this chip into a next iteration Apple ][, Commodore, Atari, BBC Micro, etc.

The registers are 1, 2, or 3 bytes wide, but the state machine ends up iterating over extra cycles loading the extra bytes.

Thus lda.w $00 issues two memory reads: $00 and $01. A ends up with the little endian value [$01][$00]
Similarly lda.t $1234 issues three memory reads with the value [$1236][$1235]$1234] stored in register A

With an 8-bit bus, there is no concept of word alignment as seen on the 68000 or RISC chips. You can store to $1234 and then load 1, 2, or 3 bytes from $1233 or #1235 if you want.

I can see that being a problem if some later version of this chip moved to a 16-bit or 24-bit wider data bus, but by the 1980s the CPU clock speeds were faster than memory speeds, and thus by then one would expect a cache between the CPU and memory, and the cache could deal with unaligned requests, as the cache would be loading more than 2 or 3 bytes per load.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 22, 2022 6:47 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Thanks for sharing! Interesting ideas indeed.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 27, 2022 7:42 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Over on http://forum.6502.org/viewtopic.php?f=2&t=7304 we're talking about whether zero page would be treated more like addressable registers if the assembly mnemonics treated them like registers instead of 8-bit addressable memory. In testing out those ideas, I whipped up a bubble sort that sorts 24-bit values. Which, of course, makes me want to see for this thread what happens if that algorithm is coded up with 24-bit registers.

The traditional (non-optimized) 6502 version is:
Code:
  ldy #0
loop_Y:
    lda #00         ; Is the data sorted (variable $FF = #0 if sorted and #1 if not)
    sta $FF
    ldx #0          ; Loop forward so that items move from left to right
loop_X:
    lda $400,X
    cmp $403,X      ; N[X][0] ?= N[X+1][0]
    beq  +byte2     ; IF (X+1 == X) THEN check next byte
    bcs  +swap      ; IF (X+1 < X) THEN swap
    clv
    bvc +next
byte2:
    lda $401,X
    cmp $404,X
    beq  +byte3     ; IF (X+1 == X) THEN check the next byte
    bcs  +swap      ; IF (X+1 < X) THEN swap
    clv
    bvc +next
byte3:
    lda $402,X
    cmp $405,X
    beq  +next      ; IF (X+1 == X) THEN no swap
    bcc  +next      ; IF (X+1 < X) THEN next loop
swap:
    lda $400,X      ; $400/1/2,X -> $00/1/2
    sta $00
    lda $401,X
    sta $01
    lda $402,X
    sta $02
    lda $403,X      ; $403/4/5,X -> $400/1/2,X
    sta $400,X
    lda $404,X
    sta $401,X
    lda $405,X
    sta $402,X
    lda $00         ; $00/1/2 -> $403/4/5,X
    sta $403,X
    lda $01
    sta $404,X
    lda $02
    sta $405,X
    inc 255         ; No, the data isn't sorted
next:
    inx
    inx
    inx
    cpx #27         ; Loop N-1 items (N=10, each items is 3 bytes), so 9x3 = 27
    bne -loop_X
    lda $FF
    bne -loop_Y
    rts

53 lines of code (including the six labels), 108 bytes of machine code

To create the 24-bit register version, I just took the code, added a few .t's, and took out all the parts that iterate over the three individual bytes. In debugging I did notice that the traditional version is comparing big-endian values instead of little-endian values, but I didn't fix that bug.
Code:
00d800 a0 00              ldy #$0
00d802                  loop_y:
00d802 a9 00              lda #$0
00d804 85 ff              sta $ff
00d806 a2 00              ldx #$0
00d808                  loop_x:
00d808 20 30 f8           jsr $f830
00d80b 2f bd 00 04        lda.t $000400,X
00d80f 2f dd 03 04        cmp.t $000403,X
00d813 f0 1a              beq +26
00d815 90 18              bcc +24
00d817                  swap:
00d817 2f bd 00 04        lda.t $000400,X
00d81b 2f 85 00           sta.t $00
00d81e 2f bd 03 04        lda.t $000403,X
00d822 2f 9d 00 04        sta.t $000400,X
00d826 2f a5 00           lda.t $00
00d829 2f 9d 03 04        sta.t $000403,X
00d82d e6 ff              inc $ff
00d82f                  next:
00d82f e8                 inx
00d830 e8                 inx
00d831 e8                 inx
00d832 e0 1b              cpx #$1b
00d834 d0 d2              bne -46
00d836                  test:
00d836 c8                 iny
00d837 a5 ff              lda $ff
00d839 d0 c7              bne -57
00d83b                  done:
00d83b 60                 rts

Only 28 lines of code and only 60 byes of machine code.

The choice of 24-bit values certainly helped in that comparison. 16-bit values would have been the smaller for the traditional version and the same 60 bytes for the larger registers, but 32-bit values or wider would require comparisons in chunks for both CPUs, but in 16-bit or 24-bit chunks for the 652402.

I chose bubble sort as it's a whole page of code in length, and as the memory swap can't be done with just A, X, and Y. That swap requires either using zp or the stack. It thus shows off how much more efficient a 6502 can be if such operations can be done 2 or 3 bytes at a time, even if underneath the machine code the CPU is actually just iterating over bytes on an 8-bit data bus.

My assembler doesn't count total cycles, but the wider register version is certainly going to be fewer cycles, as if nothing else it eliminates all the branches in comparing one byte at a time vs. three bytes in one shot.

I'd love to see this same unoptimized algorithm in 65C816 assembly too for comparison. Can someone who knows it post it?


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 06, 2022 8:55 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
After coding a few thousand lines of assembly for an emulated 652402, I've found one oddity that could use some advice.

LDA.w $(zp),Y does what you would expect, loading a 16-bit value from the indirect indexed address. There is no question that having 16-bit and 24-bit values makes coding realworld code a whole lot easier.

The problem is, sometimes you really just want one byte. Like when you are processing ASCII or when you are copying a buffer that can be an odd number of bytes. The problem is LDA $(zp),Y is ambiguous when it comes to the size of the address hiding behind $zp and the size of the Y register that is added to that address.

What I realized in debugging a buffer copying loop is that the size of X and Y in all the indexed address modes must be defined by the width of the addresses in those instructions, not by the specified width of the registers. So it's LDX.w $1000 to load 16-bits into X, but then it's LDA.a24 $F0000,X to load one byte from $F1000 and LDA.a24.t $F0000,X to load three bytes from that address, in both cases the .24 suffix ensuring that the absolute address is 24-bits as is the X register.

That works fine when the address in the instruction is absolute or indirect, but the $zp modes are 8-bit addresses, not 16-bit or 24-bit. I've not found a way to make this work in a way that feel elegant. And the more code I write, the more I find the ($zp),Y address mode useful and irreplaceable.

Do you see an better solution?

Meanwhile... what became more and more obvious as I get used to this odd CPU is that I'm often moving addresses into zp for use in that ($zp),Y address mode. Most often these are addresses stored in tables. Given I can load 24-bits into A, X, or Y in a single mnemonic, it feels like an extra step to then save to zp rather than just use X and Y as a base address and index directly.

Today I added three new addressing modes LDA XY, LDA (XY), and LDA (X),Y. The latter specifically to replace ($zp),Y. XY just adds X+Y to create the target address. [b](XY)]/b] adds X+Y and indirectly loads the address from that address. [b](X),Y]/b] loads an address from X, then adds Y, then loads from that aggregate address. All three can be 16-bit or 24-bit addresses, and like before, the width of X and Y are driven by that choice, separately from whether the value being loaded is 8-bit, 16-bit, or 32-bit.

The only code I've written so far for these are tests to make sure the emulator is working correctly. Next up is refactoring all my code to replace zp with (X),Y. I'm curious to see if the other two modes will get much use, but with an emulator in Go, it just takes a few minutes to add in an instruction to play with.

Thoughts? Does this stray too far from the 65xx mojo or does this feel like a natural extension to a never-built 652402 in my alternate reality where the 68000 didn't disrupt the 8-bit world?


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 06, 2022 9:16 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
It feels reasonable to me - as a matter of ISA design - once a register can hold an address, to be able to use it as an address.

In fact once you've done that, you might find you have less need of indirection through memory.

Whether this is 6502ish is another question!


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 06, 2022 9:33 pm 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336
65LUN02 wrote:
What I realized in debugging a buffer copying loop is that the size of X and Y in all the indexed address modes must be defined by the width of the addresses in those instructions, not by the specified width of the registers.


I haven't been following this thread in detail, and I am not as familiar with your CPU as you are. But I would question this.

On the 65020, I have the index registers always adding 32 bits regardless of everything else. And I very often turn the 6502 indexed addressing upside-down: instead of adding a small variable offset (the index register) to a fixed base address, it's adding a small constant offset to a variable address. Routines that access data in memory don't need it to be in a single fixed location, but can take a pointer to it in a register.

This re-interpretation of indexed addressing has meant that I have not needed to use the indirect modes at all. If I did, (zp, X) would be the useful one.

My CPU is rather different from yours. I have a lot more registers, for a start. That means it's possible for code to keep data in registers, which means I'm not using memory as a scratchpad. This approach might not be as practical on yours.

Quote:
Does this stray too far from the 65xx mojo


My solution does. The 65020 doesn't feel anything like a 6502. It's a quirky early RISCish processor (if you ignore the half of the instruction set that I never use) that happens to be able to run legacy code.


Top
 Profile  
Reply with quote  
PostPosted: Wed Sep 07, 2022 12:13 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
John West wrote:
65LUN02 wrote:
On the 65020, I have the index registers always adding 32 bits regardless of everything else. And I very often turn the 6502 indexed addressing upside-down: instead of adding a small variable offset (the index register) to a fixed base address, it's adding a small constant offset to a variable address. Routines that access data in memory don't need it to be in a single fixed location, but can take a pointer to it in a register.


Oh, I see. So LDA $aaaa,Y becomes the equivalent of LDA ($zp),Y with $aaaa as a pseudo-immediate offset. That is one of those assembly language patterns that is totally obvious once it is pointed out.

My immediate reaction is to modify the compiler to make that upside-down format right-side up, adding a syntax like LDA Y,+vvvvvv using the + on purpose to make it obvious that it's an offset, not an address.

As I refactor my code to swap out the LDA ($zp),Y's, I'll try a few your way and try a few with LDA (X),Y and see if having the offset makes the loops easier to write and later, read.

Thank you.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 24, 2022 7:22 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Turns out all that is needed is LDA XY.
LDA XY is the same as LDA (zp),Y but without the need to STA into zp.

Any further need of indirection can be done by LDA XY, TAX, and another LDA XY
But mostly the pattern I need is LDX from, LDA X followed by LDX to, STA X to move a sequence of bytes from one place to another, sometime with the same Y index and sometimes with LDY grabbing the matching index for each sequence.

And while LDA $0,X is equivalent to a LDA X mnemonic, avoiding having to LDY #0, I had enough cases where I would have used a LDA X instruction that I traded off the one opcode and added that to my CPU emulator rather than having the assembler turn it into LDA $0,X. That makes the code one byte smaller per use and avoids having to leave #$00 in zp $00.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 25 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: