6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon May 06, 2024 11:26 pm

All times are UTC




Post new topic Reply to topic  [ 34 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 2:29 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3353
Location: Ontario, Canada
Thanks for the kind words, Michael. It's fun to help others tinker with their designs. Indeed, I seem to recall I also incited you to overload the prefixes recognized by your core. And, speaking of your core, congratulations on the progress you recently reported in the thread Pascal Compiler and Assembler for M65C02A processor core.

Quote:
My core's mov instruction has a similar single cycle transfer mode which limits the interrupt latency to 3-5 clock cycles.
Right. In fact, you steered us toward this idea earlier in this thread when you suggested breaking down "long instructions as "partial" implementations of the algorithms." Another example comes to mind, one which pertains to the AM29000 RISC family which I used to drool over back in the day. :) Instead of a Multiply Instruction, what it offers is a Multiply Step instruction, which produces one bit of result. So, you need a rather repetitive string of identical Multiply Step instructions to get the full result. But!... low interrupt latency and RISC-y simplicity are preserved. :!:

I had some further thoughts about the block-move for Proxy, and was making edits when your post appeared. What strikes me is this. In the usual case when there's not any interrupt, it's inefficient to start all over again when we already know what's going on! In other words, when another iteration begins it would be better to restart from Cycle 3 -- or even later -- rather than re-fetching the instruction Prefix, the Opcode, etc etc etc. (But if there is an interrupt, then you do roll PC all the way back before allowing it to get pushed by the interrupt, and after the ISR the block move recommences from Cycle 1.)

Also, I'm dithering over whether it might be better if the instruction Operand were to indicate an offset into the stack (rather than a z-pg address). The stack would probably be more convenient to code for. But adding the Operand to S will probably cost a cycle, and that kinda hurts performance unless that extra cycle is excluded from the restart loop. No doubt it can be excluded, but I expect there'd be a cost in resources. Always tradeoffs... :roll:

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Sat Jun 26, 2021 12:12 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 11:56 am 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Dr Jefyll wrote:
It's slower than DMA, but one approach would be to create an instruction that moves just one byte, and can automatically execute repeatedly (kind of like what the '816 does with MVP and MVN).

Wait, Slower than DMA?! well i can't allow something like that in an FPGA Core all about low cycle times and speed, now can i? :wink:

Dr Jefyll wrote:
In the following example, the source and destination addresses are held in a combination of registers and zero-page. The usage is unusual. X and Y hold the least significant bytes of the addresses -- the address "low" or ADL bytes. The two AHD ("high") bytes are held in two consecutive bytes in z-pg. Each iteration would go like this:

  • Cycle 1: read the instruction Prefix.
  • Cycle 2: read the instruction Opcode
  • Cycle 3: read the instruction operand. It points to the z-pg location of the two bytes. Let's say the operand is $42.
  • Cycle 4: read location $42. This gives us the ADH of the Source address.
  • Cycle 5: read a byte to be moved. The address is formed by ($42) concatenated with X. Increment X.
  • Optional Cycle 5a: if X overflowed (ie; is now =0), increment the ADH value and write it back to $42.
  • Cycle 6: read location $43. This gives us the ADH of the Destination address.
  • Cycle 7: write the byte to be moved. The adress is formed by ($43) concatenated with Y. Increment Y. Decrement Z (which holds the count). If Z <>0, don't advance PC. Instead, roll it back so it points to the Prefix again.
  • Optional Cycle 7a: if Y overflowed (ie; is now =0), increment the ADH value and write it back to $43.

I see what you mean, Each Operation is it's own instruction and it just stays on it's own Opcode until Z == 0.
but like you said later on this also introduces inefficiencies as the Opcode and Operands have to be read in every single Operation.

my idea was to have a seperate DMAC inside the CPU that could be activated on command and would pause itself (or completely stop) if an Interrupt or Abort is caught.
The plus side is that it requires fewer new Microinstructions and is much faster, downside is that it does add a lot of new registers and logic.
something like this:
Code:
CPU:                                            State Machine (SM):
0. Load Prefix; End of Instruction              0. Nothing
0. Load Opcode                                  0. Nothing
1. Load Operand byte into the TBL Register      1. SM detects that it's Opcode is loaded; Was the SM Interrupted Last time (BUSY == 1)? if yes Jump to Cycle 8
2. NOP                                          2. if not, Load Operand "BytesLow" from the Stack at Address SP+TBL+0
3. Nothing                                      3. Load Operand "BytesHigh" from the Stack at Address SP+TBL+1
4. Nothing                                      4. Load Operand "SourceLow" from the Stack at Address SP+TBL+2
5. Stay on Cycle 5 until Interrupt is received  5. Load Operand "SourceHigh" from the Stack at Address SP+TBL+3
5. Stay on Cycle 5 until Interrupt is received  6. Load Operand "DestinationLow" from the Stack at Address SP+TBL+4
5. Stay on Cycle 5 until Interrupt is received  7. Load Operand "DestinationHigh" from the Stack at Address SP+TBL+5; Set BUSY
5. Stay on Cycle 5 until Interrupt is received  8. Load into TEMP Register from Memory at Address SOURCE + BYTES
5. Stay on Cycle 5 until Interrupt is received  9. Store TEMP Register into Memory at Address DEST + BYTES; if BYTES <> 0 Jump to Cycle 8 and Decrement BYTES (Clear BUSY if BYTES == 0)

Repeat until BYTES == 0

5. Stay on Cycle 5 until Interrupt is received  10. End the SM, send a fake Interrupt to the Control Unit
6. End of Instruction; Load next Opcode         11. Nothing

(note that this might change if i go for a 24 bit Address Bus, requiring 2 extra cycles before the DMAC starts it's loop for the 2 extra Address Bytes)

as you can see the actual Moving is a simple 2 cycle loop, 1 cycle to load a byte, and 1 to write it back and Advance to the next Address. (which adds up to around ~0.47MB/s per MHz)
there is also the "BUSY" FlipFlop that gets set at the SM's Cycle 7 (meaning that it will be 1 at the start of Cycle 8 ) and gets cleared at Cycle 9 (only when the loop ends).
the purpose of the BUSY FlipFlop is to keep track if the SM has already started it's loop or not.
so when an Interrupt occours before BUSY is set (ie when it's loading Operands) then the next time the Instruction starts it will start from the very beginning.
but if an Interrupt occours after BUSY is set then BUSY will remain set, so the next time the Instruction starts it will skip the Operand fetching and go straight back to the Loop.
the only exception (like usual) is Abort, when an Abort occours the SM is forced to set the BUSY FlipFlop to 0

I think this should cover ever possible Interrupt case

Dr Jefyll wrote:
Also, I'm dithering over whether it might be better if the instruction Operand were to indicate an offset into the stack (rather than a z-pg address). The stack would probably be more convenient to code for. But adding the Operand to S will probably cost a cycle, and that kinda hurts performance unless that extra cycle is excluded from the restart loop. No doubt it can be excluded, but I expect there'd be a cost in resources. Always tradeoffs... :roll:

Hmm, the stack seems to be a convienent place to load Operands from, though i cannot modify the value of the SP while reading Operands, for the case that an Interrupt occours right when the SM loads it's last few operands, because the Interrupt sequence would overwrite the previously loaded bytes on the stack.
the order of bytes on the Stack could be: (top = lower address, bottom = higher address)
  • Amount to Move Low byte
  • Amount to Move High byte
  • Destination Address Low byte
  • Destination Address High byte
  • Source Address Low byte
  • Source Address High byte

Dr Jefyll wrote:
There are many, many ways this could work. You might want to have a look at my 1988 KK Computer, which uses a scheme that's fairly simple and general. There are four bank registers, called K0-K3. K1 to K3 can be invoked by the programmer at will. In other words, they're not tied to any specific function, such as Vector Bank Register. And K0 is the default -- it's what you get when you don't ask to invoke K1-K3.

I opted to keep zero-page and the stack in Bank 0 at all times, which means K0-K3 are ignored for these accesses. I'm not bothered by keeping stack and z-pg in Bank 0. The important thing (for me) is being able to address data arrays that exceed 64K. (The same 24-bit space can also contain code, but for me the need for code space is secondary compared to the need for data space.)

hmm Interesting idea to have multiple almost orthogonal Bank Registers, but unlike the 65C02 i don't have any spare/blank Opcodes that i could use as additional Prefix Bytes. while i do have 204 unused Opcodes in my extended Opcode Table, i don't like stacking Prefix bytes as the instructions would just get ridiculously long, and I don't think having multiple of the same instruction that just use a different DBR each will be worth the extra opcodes it would take up. So i don't think that idea is right for this Project.
but i could atleast add a few Load Instruction specifically for the DBR (and the other Bank Register), so switching between different Banks is less of a hassle.
for the actual Addressing modes i was thinking of these:
  • Absolute Long
  • Absolute Long X-Indexed
  • Absolute Long Y-Indexed
  • Base Page Indirect Long Z-Indexed

And the instructions that make use of those are be: LDA, STA, ADC, SBC, AND, ORA, XOR, CMP, INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, ROR
note that INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, and ROR only use the first 3 Addressing modes. (which is still more than the 65816 allows)
in total this would add up to 43 new Instructions plus any miscellaneous Instructions like Long (and maybe Indirect Long) versions of JMP/JSR/RTS, Push/Pull, Special Loads/Stores, etc

Overall i'm really warming up to the idea of extra Addressing Space, with the way i have it in mind only the new Instructions i were to add would actually interact/modify any of the new registers, all of the existing Instructions would just work like normal without requiring some rewrite/rewiring. I'll make a copy of the Logisim Version and see if i can modify the circuit to allow for Bank Registers and all that.
though i'm still not sure if an Emulation mode is necessary, the only difference it would have to Native mode is how many bytes Interrupts and BRK push onto the stack, and how many RTI pulls from the stack.
so i'm kinda thinking about saying "screw it" and not implementing an Emulation Mode at all and just hope there aren't many programs that use Interrupts, BRK, and RTI in some weird way that only works with 16 bit addresses.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 3:19 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
Proxy wrote:
I think this should cover ever possible Interrupt case.

... as long as you don't have any block copies inside your ISR, right? (just messin' with ya ...) :D

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 5:37 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8175
Location: Midwestern USA
barrym95838 wrote:
Proxy wrote:
I think this should cover ever possible Interrupt case.

... as long as you don't have any block copies inside your ISR, right? (just messin' with ya ...) :D

With the 65C816, using MVN and MVP in an ISR is certainly possible, as long as the ISR saves and restores the MPU's complete state.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 6:35 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
BigDumbDinosaur wrote:
barrym95838 wrote:
Proxy wrote:
I think this should cover ever possible Interrupt case.

... as long as you don't have any block copies inside your ISR, right? (just messin' with ya ...) :D

With the 65C816, using MVN and MVP in an ISR is certainly possible, as long as the ISR saves and restores the MPU's complete state.

well it's technically possible if you have 2 MOV Instructions one after another. the first one just moves a single byte to exactly where it was before, and the second one is the actual Block Move of your ISR.
so when a MOV was interrupted during it's loop, the first MOV it will encounter in the ISR makes it continue that loop until it's done, and then move on to the next MOV which is the one you actually want to run. (downside is the Return address will still point towards the MOV that was interrupted so that same MOV will run a second time, which can be very slow)
and in case a MOV was not interrupted the first MOV will waste like 10 cycles copying 1 byte before starting the actual MOV.
obviously neither case is ideal, but i hope the overall transfer speeds make up for the... inconvenience


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 28, 2021 5:25 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3353
Location: Ontario, Canada
Proxy wrote:
Wait, Slower than DMA?! well i can't allow something like that in an FPGA Core all about low cycle times and speed, now can i? :wink:
OK, I realize you're being funny. But aren't the low cycle times you mention directly in conflict with the added complexity of DMA (due to on-chip routing issues)? In other words, I think adding DMA (which of course is faster) will to some extent make the entire Core run more slowly.

I'm not saying don't do it. I'm just saying, try to keep things in perspective. Is block-move speed really a critical issue, or are you adding DMA just because you can?

BTW, it seems to me the proposed DMA (or perhaps only your description of it) needs some improvement. Cycle 9 of the SM says, "Jump to Cycle 8 and Decrement BYTES (Clear BUSY if BYTES == 0)." But if it keeps going until BYTES == 0 then that means the DMA can't get interrupted.

Also, Mike B was hinting at reentrancy issues. Can you walk us through this? I get that your DMA parameters would be stored on stack, so that's good. If a block-move gets interrupted and the ISR also uses a block-move then that would invove allocation of a new block of storage on stack. Then, after the RTI, we would re-fetch the original prefix and block-move opcode. But... only if the interrupted block-move didn't complete? Seems like it would have to push a different Return Address according to whether or not it completed.

Quote:
the actual Addressing modes i was thinking of these:
  • Absolute Long
  • Absolute Long X-Indexed
  • Absolute Long Y-Indexed
  • Base Page Indirect Long Z-Indexed

And the instructions that make use of those are be: LDA, STA, ADC, SBC, AND, ORA, XOR, CMP, INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, ROR
note that INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, and ROR only use the first 3 Addressing modes. (which is still more than the 65816 allows)
As you know, using a single prefix you'll only get 256 new opcodes. I suggest you spend these on providing plenty of long address modes, even though you'll have fewer instructions that can use those modes. LDA and STA are important, and will get used frequently, but IMO there's no need for dozens of instructions to have long address modes.

In particular, I suggest adding more long indirect modes, including both a non-indexed version, a version that indexes using Y, and a version that indexes before the indirection -- ie, x[ind] mode.

Indirect address modes are central to 65xx, simply because 65xx has no full-width pointer registers on-chip. It's indirect pointers that grant you full freedom to compute addresses at runtime! So, I suggest you make the most of that when you expand to a 24-bit address space.


-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 28, 2021 11:52 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Dr Jefyll wrote:
OK, I realize you're being funny. But aren't the low cycle times you mention directly in conflict with the added complexity of DMA (due to on-chip routing issues)? In other words, I think adding DMA (which of course is faster) will to some extent make the entire Core run more slowly.

I'm not saying don't do it. I'm just saying, try to keep things in perspective. Is block-move speed really a critical issue, or are you adding DMA just because you can?

I'm not an expert when it comes to hardware design, so i didn't think that taking an external DMA Controller and just taping it to the CPU would cause the entire core to be slower. but it does make sense as the Clock's Path through the DMA Circuit could take longer than through the CPU, dragging Fmax for the whole circuit down with it...
and yes i just thought to add it because the 65816 also has a block-move.

Dr Jefyll wrote:
BTW, it seems to me the proposed DMA (or perhaps only your description of it) needs some improvement. Cycle 9 of the SM says, "Jump to Cycle 8 and Decrement BYTES (Clear BUSY if BYTES == 0)." But if it keeps going until BYTES == 0 then that means the DMA can't get interrupted.

the description i posted might've been a bit confusing as there are basically 2 seperate state machines running in parallel (the DMAC and the Control Unit). and just to clearify the DMAC is not capable of preventing Interrupts going to the Control unit. Any Interrupt will cause the Instruction to end (as it's basically just a modified WAI Instruction), which in turn forces the DMAC to stop.
the BUSY Flag is just there to save the state that the DMAC is currently in, for when it gets continued after an interrupt.
ie if BUSY is 0 the Block Move Instruction starts from the very beginning and fetches the operands from the stack, and when it's 1 the Block Move skipped the Operand Fetches and immediately continues copying bytes.
though the Block Move can also be interrupted AFTER it's done... basically after BYTES == 0, the last byte has been written and BUSY has been cleared again, there is still 1 cycle left where the DMAC sends a fake interrupt to the Control Unit to end it's loop, but if a real interrupt occurs it would overwrite the fake one, and after returning it would restart the whole instruction....
which is not good.

Dr Jefyll wrote:
Also, Mike B was hinting at reentrancy issues. Can you walk us through this? I get that your DMA parameters would be stored on stack, so that's good. If a block-move gets interrupted and the ISR also uses a block-move then that would invove allocation of a new block of storage on stack. Then, after the RTI, we would re-fetch the original prefix and block-move opcode. But... only if the interrupted block-move didn't complete? Seems like it would have to push a different Return Address according to whether or not it completed.

yes if an Interrupt happens anytime during the execution of the Block Move instruction the return address that gets pushed onto the stack will always be the Address of it's own Opcode.
and sadly like said above this is also the case if the DMAC just finished copying all bytes and was just about to end the instruction.

overall i think i should drop the whole interruptible State Machine and just accept some slower but much easier to implement alternative that works similarly to MVN and MVP.

I'll just take your example you gave a while ago and modify it a bit, it limits the amount of bytes you can move to 256 and the whole instruction takes a total of 8 cycles just to copy 1 byte (~122kB/s per MHz).
it's slower than MVN/MVP because there aren't a lot of internal registers the programmer has access to like with the 65816.

the Stack contains 2 16-bit Addresses with the Source address being on the top (note that the stack pointer is not modifed, otherwise the Instruction wouldn't repeat properly)
the X and Y Registers contain the Source and Destination Data bank values.
and the Z Register has the amount of bytes to move plus one in it. (value of 0 = 1 byte to move, value of 255 = 256 bytes to move)
Code:
1.  Fetch Opcode (Prefix); END Instruction
2.  Fetch Opcode; If the Z Register is equal to 0, set PC to the Address of the next Opcode, otherwise don't
3.  Load byte into the TAL Register from Address SP + 0
4.  Load byte into the TAH Register from Address SP + 1
5.  Load byte into the A Register from Address X:TA + Z
6.  Load byte into the TAL Register from Address SP + 2
7.  Load byte into the TAH Register from Address SP + 3
8.  Store byte into Memory at Address Y:TA + Z from the A Register; Decrement Z; END Instruction

this should still be a bit faster than using a software routine.
EDIT: it's hard to tell where the operand byte is relative to the PC. as the PC could either be on the current or on the next opcode, so instead of dealing with that i just removed the operand which does make it slightly faster at the cost of some flexibility, since the addresses now have to be on the top of the stack.

Dr Jefyll wrote:
As you know, using a single prefix you'll only get 256 new opcodes. I suggest you spend these on providing plenty of long address modes, even though you'll have fewer instructions that can use those modes. LDA and STA are important, and will get used frequently, but IMO there's no need for dozens of instructions to have long address modes.

In particular, I suggest adding more long indirect modes, including both a non-indexed version, a version that indexes using Y, and a version that indexes before the indirection -- ie, x[ind] mode.

Indirect address modes are central to 65xx, simply because 65xx has no full-width pointer registers on-chip. It's indirect pointers that grant you full freedom to compute addresses at runtime! So, I suggest you make the most of that when you expand to a 24-bit address space.


yea that makes sense, how about these?
  • Absolute Long ($000000)
  • Absolute Long X-Indexed ($000000,X)
  • Absolute Long Y-Indexed ($000000,Y)
  • Base Page X-Indexed Indirect Long (<[$00,X])
  • Base Page Indirect Long Y-Indexed (<[$00],Y)
  • Base Page Indirect Long (<[$00]) (could also use the Z Register as Index like the Base Instriction Set)
  • Base Page Indirect Long Double-Indexed (<[$00,X],Y)
  • Absolute X-Indexed Indirect Long ([$0000,X])
  • Stack Indirect Long Y-Indexed ([$00,SP],Y)
so 6 out of 9 addressing modes are Indirect, which i'd mostly use for LDA and STA (Absolute X-Indexed Indirect Long might be limited to some sort of JMP Instruction like with the 65816)

though i still feel like some Arithmetic and Logic Instructions could still benifit from having some Long Addressing modes. Specifically ADC, SBC, AND, ORA, XOR, and CMP. which could make use of the first 5 Addressing modes i listed above, like they do in the Base Instruction set. having those could likely help with Position Independed Data Structures.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Fri Jul 09, 2021 6:36 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
alright, small inbetween update.
Feature creep kinda got me in it's fangs for some time.

basically i changed the Address ALU and added the extra Registers required for 24 bit Addressing, but i also wanted to have an extra Flag that would allow the PBR to change with the PC whenver values are added to the PC (like through relative jumps or normal program flow). this flag is called "w" for "disable PC wrapping".
problem was that i needed to put the flag somewhere, but i don't like the 65816's way of stacking flags ontop of eachother, so instead i just added a new Status Register which is internally called "FL1" for "Flags 1".
but just having a single bit in there seemed a bit lonely, so i also implemented priority Interrupts (IRQ0 (highest) - IRQ7 (lowest)) and also a new flag that switches between all IRQs (and BRK) sharing the same Interrupt Vector (for backwards compatibility), and each of them having their own seperate Vector.
And now i'm currently in the process of reordering all extended Instructions to better line up with the base instruction set.

my unused Opcode count went from 204 to 125 (79 new Instructions), though that could still increase by 6 since i recently saw that the 65816 has the Stack Indirect addressing mode for ADC, SBC, AND, ORA, XOR, and CMP... though i fear that might just be feature creep whispering into my ear again, telling me i need more instructions


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 1:53 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3353
Location: Ontario, Canada
Proxy wrote:
Feature creep kinda got me in it's fangs for some time.
No! D' ya think? :wink:

Well, that's alright -- it's a hobby, and you can do whatever you want. I, too, have been known to get carried away adding features to a project. :roll: :oops:

Adding a second Flags register sounds somewhat complicated, though. Does it get pushed to stack when an interrupt occurs, then later recovered when the RTI executes? That will slow the CPU's interrupt response, and one has to ask, is it worth it? Presumably you have already weighed the tradeoff. The decision will depend on your own goals and circumstances.

Quote:
my unused Opcode count went from 204 to 125 (79 new Instructions), though that could still increase by 6 since i recently saw that the 65816 has the Stack Indirect addressing mode for ADC, SBC, AND, ORA, XOR, and CMP...
There are eight altogether -- you've omitted LDA and STA.

Is the Stack Indirect addressing mode a case of feature creep? I wouldn't say so, but that's just a reflection of my own goals and priorities.

Indeed, if it were my project I'd be tempted to go one step further and add Long Stack Indirect addressing mode!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 2:22 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
proxy:

I wholeheartedly concur with Dr. Jeffyl's statement:
Dr. Jefyll wrote:
Well, that's alright -- it's a hobby, and you can do whatever you want. I, too, have been known to get carried away adding features to a project. :roll: :oops:
I have been happily working on my extended 6502/65C02 soft-core for more years than I'd care to admit. Although I have long ago concluded that the accumulator / memory architecture of the 6502/65C02 is not optimal for high performance, through pipelining, for example, I try to maintain the basic architectural feel of these processors in any instruction set architecture (ISA) extensions that I add and experiment with.

If I throw in my $0.02 worth of advice, I would say to tailor your extensions to support HLLs, and one of those extensions would be to support the stack-relative addressing mode in both short and long forms, as Dr Jefyll suggests. I would also consider extending the pre-indexed addressing modes, both direct and indirect versions, to support short and long addressing.

The stack-relative addressing mode that I added to my core, inspired in large part on its availability in the '816 processor, has proven to be invaluable in my implementation of a Pascal compiler for my core. The other, base-relative or bp-relative, is based on the pre-indexed addressing mode of the 6502/65C02 ISA. It has also proven invaluable in making the access to nested variables stored on the stack, as is supported by Pascal.

In the definition of my core, I implemented some of Dr. Jeffyl's suggestions regarding the Forth VM's registers. In doing so, I have a very low overhead support for indirect / direct threaded code (ITC / DTC) FORTH. I thought that to support that language I should also provide a number of instructions (16) supporting a IP (with autoincrement) addressing mode. Over time I've concluded that all but the LDA ip,I++ and STA ip,I++ instructions are not useful to an HLL program, and even then the autoincrement attribute is not desirable with my Pascal compiler. So I'm in the process of refactoring my ISA to reuse 14 of those instructions and removing the autoincrement attribute of the register indirect addressing mode that my current I++ addressing mode represents.

I enjoy reading your thread, so please continue the work. It's a great hobby project.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 5:50 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Dr Jefyll wrote:
Proxy wrote:
Feature creep kinda got me in it's fangs for some time.
No! D' ya think? :wink:

Well, that's alright -- it's a hobby, and you can do whatever you want. I, too, have been known to get carried away adding features to a project. :roll: :oops:

Adding a second Flags register sounds somewhat complicated, though. Does it get pushed to stack when an interrupt occurs, then later recovered when the RTI executes? That will slow the CPU's interrupt response, and one has to ask, is it worth it? Presumably you have already weighed the tradeoff. The decision will depend on your own goals and circumstances.

the second Flags Register can only be effected by it's Transfer Instructions (TAF and TFA), and IRQs (bits 1-3 change to the priority of the last IRQ that occurred). i thought that this Register would be rarely changed (at most it's only read out in an ISR to see what the IRQ's priority was) so it would make little sense to put it on the stack during an Interrupt.

Dr Jefyll wrote:
Quote:
my unused Opcode count went from 204 to 125 (79 new Instructions), though that could still increase by 6 since i recently saw that the 65816 has the Stack Indirect addressing mode for ADC, SBC, AND, ORA, XOR, and CMP...
There are eight altogether -- you've omitted LDA and STA.

Is the Stack Indirect addressing mode a case of feature creep? I wouldn't say so, but that's just a reflection of my own goals and priorities.

i didn't mention LDA and STA because the base 65CE02 already adds those. so i would only need to add that addressing mode for the remaining arithmetic/logic instructions.
and they do seem really really useful for using pointers in functions without having to grab them off the stack and putting them on the base page.
hmmm...
Dr Jefyll wrote:
Indeed, if it were my project I'd be tempted to go one step further and add Long Stack Indirect addressing mode!

the funny thing is, i already did. though it's limited to LDA and STA.
I really want to add these Addressing modes to ADC, SBC, etc but that would be an extra 12 Opcodes.
I wonder... what if i were to only add the Long (ie 24 bit) Stack Indirect Addressing mode for ADC, SBC, etc. and just leave out the 16 bit version entirely? (except for LDA and STA of course since they are the in the Base Instruction Set)
it would save me 6 Opcodes at the cost of an extra byte on the Stack for every pointer and slightly slower cycle times.
also one noticable difference would be that Long Stack Indirect Addressing doesn't wrap at 64k boundaries, while Regular Stack Indirect Addressing does.

MichaelM wrote:
I wholeheartedly concur with Dr. Jeffyl's statement. I have been happily working on my extended 6502/65C02 soft-core for more years than I'd care to admit. Although I have long ago concluded that the accumulator / memory architecture of the 6502/65C02 is not optimal for high performance, through pipelining, for example, I try to maintain the basic architectural feel of these processors in any instruction set architecture (ISA) extensions that I add and experiment with.

I fear that i might work on this core for the next 10 years and never build an actual Computer with it.
i'd like to say i also try to keep the general "feel" of the base CPU when extending it, since i'm mostly looking at the 65816 and this forum for rough ideas of what is useful to add. but i'm not 100% sure if i'm successful. i guess i can only really tell once i start writing actual programs for it.
MichaelM wrote:
If I throw in my $0.02 worth of advice, I would say to tailor your extensions to support HLLs, and one of those extensions would be to support the stack-relative addressing mode in both short and long forms, as Dr Jefyll suggests. I would also consider extending the pre-indexed addressing modes, both direct and indirect versions, to support short and long addressing.

HLL are not really something i know a lot about right now, so it's hard for me to tell what is still "missing" from the current set of Instructions/Features to allow some HLL to generate decent machine code.
I would like to learn about Compilers and such so i could maybe add this CPU as a new Target for some existing C Compiler like GCC or LLVM. (or maybe even CC65?) but that will probably take a while...
also I'm a bit confused, why have both long (24 bit) and short (16 bit) Stack Indirect Addressing modes? i thought just having the long version would be enough since it can function almost identically to the short version if you just push the Data Bank onto the stack before pushing the actual Pointer.
is it just to save stack space and make the code a bit faster/smaller?
also what exactly do you mean with "pre-indexed"? and so far i've added long versions of all Addressing modes that have any kind of Absolute Address (16 bit) in them (Direct and Indirect). (only kinda exception being Stack Indirect, of which i have only added the 24 bit version with the 16 bit ones being limited to LDA and STA)
MichaelM wrote:
The stack-relative addressing mode that I added to my core, inspired in large part on its availability in the '816 processor, has proven to be invaluable in my implementation of a Pascal compiler for my core. The other, base-relative or bp-relative, is based on the pre-indexed addressing mode of the 6502/65C02 ISA. It has also proven invaluable in making the access to nested variables stored on the stack, as is supported by Pascal.

In the definition of my core, I implemented some of Dr. Jeffyl's suggestions regarding the Forth VM's registers. In doing so, I have a very low overhead support for indirect / direct threaded code (ITC / DTC) FORTH. I thought that to support that language I should also provide a number of instructions (16) supporting a IP (with autoincrement) addressing mode. Over time I've concluded that all but the LDA ip,I++ and STA ip,I++ instructions are not useful to an HLL program, and even then the autoincrement attribute is not desirable with my Pascal compiler. So I'm in the process of refactoring my ISA to reuse 14 of those instructions and removing the autoincrement attribute of the register indirect addressing mode that my current I++ addressing mode represents.

I enjoy reading your thread, so please continue the work. It's a great hobby project.

again i don't really know the inner workings of any HLL, so most of the additions i've made are from the perspective of writing in Assembly and i'm just kinda hoping that HLLs are gonna be fine with what i have right now. even if not i can always add or adjust instructions.

anyways, thanks for the kind words at the end. the project is pretty diffcult and stressful at times, but overall it's fun to work out all the issues and features.

I have uploaded an Updated Opcode table with all Addressing Modes on the side.
none of the new Instructions are explained in the image, but most of them are just existing instructions with new Addressing Modes or are just taken from the 65816 and function pretty much identically.


Attachments:
2021-07-11_19-38-50.png
2021-07-11_19-38-50.png [ 847.7 KiB | Viewed 6546 times ]
Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 7:27 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
proxy wrote:
HLL are not really something i know a lot about right now, so it's hard for me to tell what is still "missing" from the current set of Instructions/Features to allow some HLL to generate decent machine code.
I don't think that you have to be an expert in HLLs, or even in how HLLs generate machine code. You could simply examine the output from a number of compilers, which for small systems, can generally intermix the HLL constructs with the assembly / machine language output source. In this manner you can see that a large number of the 6502/65C02 ISA instructions or addressing modes are never used by the code generators. Those unused instructions and addressing modes represent "wasted" opcode space and development effort. For example, given all of the effort that I expended on including the Rockwell instructions, I have yet to encounter an HLL construct that I can implement efficiently using those instructions. I've kept them in my ISA for compatibility purposes. Those 32 opcodes could be put to better use if I didn't want to lay down the claim that my core is still somewhat compatible with the base 6502/65C02 ISA. Finally, a simple instruction histogram that I performed on the code my Pascal compiler generated showed that roughly 64 instructions from my cores repertoire of more than unique 2000 instructions, using single or multiple prefix codes, were used in the implementation of the "Sieve of Eratosthenes" program. (Just for full disclosure. I attribute a bit of proprietariness to the PC65 compiler, but I did not develop it from scratch. Although I did correct a couple of issues, I mostly adapted its code generator to support my extended 6502 ISA soft-code processor ISA. So I will not make any claims regarding being able to build compilers from scratch. :) )
proxy wrote:
also I'm a bit confused, why have both long (24 bit) and short (16 bit) Stack Indirect Addressing modes? i thought just having the long version would be enough since it can function almost identically to the short version if you just push the Data Bank onto the stack before pushing the actual Pointer.
is it just to save stack space and make the code a bit faster/smaller?
If it was my pen and paper, I would attempt to support data structures that are greater than 64kB in size. The easiest way to do that within an HLL is to provide "linear" addressing rather than the mod 65536 manner in which the '816 provides for such data structures. It's not that it can't be done, but it is a bit difficult to manage efficiently. In my core, I did not increase the address space except through a paged memory management scheme similar to that which I used ages ago on my PDP11/24 minicomputer. (I wouldn't want to attempt the paged memory management system of RSX-11/M that I used for creating and maintaining programs greater than 64kB in length.) I support both 8-bit offsets and 16-bit offsets into the stack, and so the programmer does not have to manage the selection of the offset / base address, I do those selections in the compiler.
proxy wrote:
also what exactly do you mean with "pre-indexed"? and so far i've added long versions of all Addressing modes that have any kind of Absolute Address (16 bit) in them (Direct and Indirect). (only kinda exception being Stack Indirect, of which i have only added the 24 bit version with the 16 bit ones being limited to LDA and STA)
In the 6502/65C02 ISA, pre-indexed addressing modes, both direct and indirect, are associated with the X register. The X register is added to the base address before indirection is performed. Hence the term pre-indexed. The Y register is added after indirection, and those Y register addressing modes are known as post-indexed. In the direct addressing modes, it makes no difference when the index register, X or Y, is applied since there's not a second address cycle.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 8:20 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
MichaelM wrote:
I don't think that you have to be an expert in HLLs, or even in how HLLs generate machine code. You could simply examine the output from a number of compilers, which for small systems, can generally intermix the HLL constructs with the assembly / machine language output source. In this manner you can see that a large number of the 6502/65C02 ISA instructions or addressing modes are never used by the code generators. Those unused instructions and addressing modes represent "wasted" opcode space and development effort. For example, given all of the effort that I expended on including the Rockwell instructions, I have yet to encounter an HLL construct that I can implement efficiently using those instructions. I've kept them in my ISA for compatibility purposes. Those 32 opcodes could be put to better use if I didn't want to lay down the claim that my core is still somewhat compatible with the base 6502/65C02 ISA. Finally, a simple instruction histogram that I performed on the code my Pascal compiler generated showed that roughly 64 instructions from my cores repertoire of more than unique 2000 instructions, using single or multiple prefix codes, were used in the implementation of the "Sieve of Eratosthenes" program. (Just for full disclosure. I attribute a bit of proprietariness to the PC65 compiler, but I did not develop it from scratch. Although I did correct a couple of issues, I mostly adapted its code generator to support my extended 6502 ISA soft-code processor ISA. So I will not make any claims regarding being able to build compilers from scratch. :) )

oh i see, just looking at the compiler output to see how it's doing things, if there are operations it's struggling with and how they can be improved.
main problem is that i currently have no compiler for anything, i'm already kinda dreading the day i have to port some compiler to work with this CPU, though since it's somewhat close to a 65816 i should hopefully be able to get some help from here.

MichaelM wrote:
If it was my pen and paper, I would attempt to support data structures that are greater than 64kB in size. The easiest way to do that within an HLL is to provide "linear" addressing rather than the mod 65536 manner in which the '816 provides for such data structures. It's not that it can't be done, but it is a bit difficult to manage efficiently.

that's exactly the reason why i want to omit the 16 bit Stack Indirect Addressing mode, it would limit you to the current Data Bank while the 24 bit one can point to anywhere in Memory.
and technically the CPU has a linear address space, if you set the W Flag and exclusively use the Long Addressing modes instead of the Absolute ones (which excludes some instructions like ASL/ASR/LSR/INC/DEC), then it should be pretty close to a 68k in terms of Memory linearity.
MichaelM wrote:
In my core, I did not increase the address space except through a paged memory management scheme similar to that which I used ages ago on my PDP11/24 minicomputer. (I wouldn't want to attempt the paged memory management system of RSX-11/M that I used for creating and maintaining programs greater than 64kB in length.) I support both 8-bit offsets and 16-bit offsets into the stack, and so the programmer does not have to manage the selection of the offset / base address, I do those selections in the compiler.

originally i also just wanted to use this CPU with a simple MMU to increase it's memory range, but as you can see i decided to just give the CPU a larger address range.
it definitely wasn't the easier choice of the two, but i think it will pay off in the end.
also i just noticed that with the unsigned 8 bit offset into the stack that the Long Stack Indirect Addressing mode has it would only allow for a total of 85 24-bit Addresses on the Stack.
it should be fine for now, but if not i can always use the Z Register as the high byte of the offset, so the instruction stays the same size but the offset can then access the whole stack.
speaking of Stack my core also has a "Stack Y-Indexed" Addressing mode that allows you to directly access values on the Stack without having to use Stack operations (PHA/PLA), like before the offset is only 8 bits but i could also extend it using the Z Register, allowing access to the whole stack. but not right now.
MichaelM wrote:
In the 6502/65C02 ISA, pre-indexed addressing modes, both direct and indirect, are associated with the X register. The X register is added to the base address before indirection is performed. Hence the term pre-indexed. The Y register is added after indirection, and those Y register addressing modes are known as post-indexed. In the direct addressing modes, it makes no difference when the index register, X or Y, is applied since there's not a second address cycle.

i see, i got a bit confused by that since i never heard that term before.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jul 11, 2021 10:33 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
It's your project, and you should do as you see fit, otherwise it's a job and not a hobby. As I said before, I have enjoyed your thread, and look forward to more from you. It's interesting to read project descriptions like this one, especially for the different perspectives that others have. There's no single, right answer on an ISA.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Thu Nov 11, 2021 6:13 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I knew you were working on something impressive but I didn't expect even a preliminary version before Dec 2021. Indeed, I expected progress from randyhyde in Aug 2021 and long before you. Your work is simultaneously modest (one prefix) and bold (multiply, divide and modulo).

If you want ideas for implementation, I have many. This includes an addressing mode for zero glue logic Z80 bus operation, multi-core atomic operations and minimal FPU.

With FPGA, you have the luxury of inventing your own signals. Separate read and write signals would be very welcome. This would skip the typical three NAND gate qualified write logic of many 6502/65816 boards. It would also aid RC2014 or RC6502 compatibility. Indeed, if you have opcodes spare, it may be worthwhile to consider one page I/O instructions. These would work like zp,X but for an I/O segment. With or without a Z80 style /IOREQ, you may be able to simplify your signals. Primarily, this would be "Program Segment", "Data Segment" and "Other Segment" where A8 and A15 may be used to determine Direct Page, Stack, I/O Segment or Vectors. In the trivial case, these would be the first two or last two pages. At the very least, you'll make drogon happy given that the Ruby anniversary computer also uses page $FE for I/O.

aleferri solved atomic operations on 6502 in May 2021. The solution is quite simple. One or more of the read-modify-write instructions should conditionally write. With your prefix instruction, you could prefix all read-modify-write instructions and make the write phase conditional for all of them. This simplistic arrangement has the benefit that it operates in a fixed number of clock cycles. Therefore, don't try to optimize the last bus cycle of atomic operations.

S+X stack indexing requires addition in some form. Personally, I'd maintain S+X as a hidden register. If you want to implement this more frugally, don't be concerned about an additional bus cycle for ALU operation because the ALU cycles may overlap with opcode fetches. When prefix is fetched, transfer S to a hidden register. When an instruction is fetched, add or transfer X to a hidden register. In either case, X or S+X is ready for use. Unfortunately, this speculative execution might interact badly with interrupts. FPGA ALU operation may also consume more energy than a mux for S+X.

I see that Dr Jefyll has led you into the quagmire of block copy. You are clever to initiate transfer from stack references but this arrangement has nowhere to dump state when transfer is stopped prematurely. Given the vaguries of memory - especially beyond 64KB - a blitter peripheral may be preferable. Agumander's GameTank blitter copies two dimensional areas and Radical Brad's blitter copies two dimensional areas with alpha channel and run length encoding. I recommend a one element FIFO and interrupt arrangement. This allows solid blitting with no loss of cycles. Power of two sprite scaling would be highly appreciated because, quite honestly, I'm tired of 6502 games below the standard of Out Run. Sprite scaling is too much to ask in discrete implementation but if you want block copy which matches or exceeds 65816 MVP/MVN, two dimensional copy covers more cases.

May I be the first to suggest the additional quagmire of floating point? I believe that a very minimal FPU can be implemented with eight instructions or less. All instructions would be abs,X only and the implementation would not conform to IEEE754 or provide conversion functions. Instructions would load, store, add, subtract, compare and multiply. This would apply to a separate accumulator and common flag register. If you choose a suitable representation then DEC abs,X provides floating point right shift and atomic DEC abs,X provides right shift without underflow. Aha! Overall, this arrangement is more suited than x87 to the Raphson approximation of inverse square root.

I asked a sheep in the particle physics field if CERN used hyperbolic tangent in relativistic calculations. The answer was "Not knowingly." Therefore, you can dump all of the stupid grandstanding floating point functions. Instead, do sine and cosine concurrently with successive approximations of rotation and implement the remainder with Maclaurin, Taylor and Raphson approximations.

If you work on an FPU, you should definitely read the FPU chapter of the John Wiley & Sons book Advanced FPGA Design by Steve Kilts. You should also read the 1987 internal INMOS documents regarding the T800 Transputer FPU. The INMOS documents have been published by David Mays. The T800 FPU document explains how to verify an FPU and was written a mere eight years before Intel Pentium Errata #2: FDIV Bug.

If you are experimenting with stack introspection, you may want to add a constant to RegS or implement PHS/PLS. Given the association of RegS and RegX, PHS/PLS is most logically implemented by escaping PHX/PLX. Unfortunately, pushing the current stack pointer onto the stack is useless. However, a computed value aids functions with a variable number of parameters. You may also want to experiment with separate program and data stacks. This may aid implementation of Forth. It is also of general interest. This could be activated with an escaped flag; possibly operating on an escaped RegQ flag register. For downward compatibility, it could instead be implemented as an opt-in, per use prefix. Unfortunately, an alt stack may be incompatible with A8/A15 segment signals. Separately, you may want to investigate a privilege system where escaped CLV drops privileges and has no counterpart to increase privileges. Likewise, PHQ/PLQ would be privileged.

Finally, I'm a keen proponent of abs,Z. You may wish to substitute all abs addressing with abs,Z. If RegZ is reset and never modified then STZ and abs,Z behave as expected. However, applications which populate RegZ have abs,X, abs,Y and abs,Z addressing modes and must zero one register to obtain abs. This arrangement is never worse than 65C02 and it is a significant boon as address-space and register width increases. Specifically, the semantic switches from abs+R to R+offset. Most immediately, for your extension, it allows (abs,X),Y or zp,X and sp,Y to be mixed with abs,Z.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 34 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron