Dr Jefyll wrote:
It's slower than DMA, but one approach would be to create an instruction that moves just one byte, and can automatically execute repeatedly (kind of like what the '816 does with MVP and MVN).
Wait, Slower than DMA?! well i can't allow something like that in an FPGA Core all about low cycle times and speed, now can i?
Dr Jefyll wrote:
In the following example, the source and destination addresses are held in a combination of registers and zero-page. The usage is unusual. X and Y hold the least significant bytes of the addresses -- the address "low" or ADL bytes. The two AHD ("high") bytes are held in two consecutive bytes in z-pg. Each iteration would go like this:
- Cycle 1: read the instruction Prefix.
- Cycle 2: read the instruction Opcode
- Cycle 3: read the instruction operand. It points to the z-pg location of the two bytes. Let's say the operand is $42.
- Cycle 4: read location $42. This gives us the ADH of the Source address.
- Cycle 5: read a byte to be moved. The address is formed by ($42) concatenated with X. Increment X.
- Optional Cycle 5a: if X overflowed (ie; is now =0), increment the ADH value and write it back to $42.
- Cycle 6: read location $43. This gives us the ADH of the Destination address.
- Cycle 7: write the byte to be moved. The adress is formed by ($43) concatenated with Y. Increment Y. Decrement Z (which holds the count). If Z <>0, don't advance PC. Instead, roll it back so it points to the Prefix again.
- Optional Cycle 7a: if Y overflowed (ie; is now =0), increment the ADH value and write it back to $43.
I see what you mean, Each Operation is it's own instruction and it just stays on it's own Opcode until Z == 0.
but like you said later on this also introduces inefficiencies as the Opcode and Operands have to be read in every single Operation.
my idea was to have a seperate DMAC inside the CPU that could be activated on command and would pause itself (or completely stop) if an Interrupt or Abort is caught.
The plus side is that it requires fewer new Microinstructions and is much faster, downside is that it does add a lot of new registers and logic.
something like this:
Code:
CPU: State Machine (SM):
0. Load Prefix; End of Instruction 0. Nothing
0. Load Opcode 0. Nothing
1. Load Operand byte into the TBL Register 1. SM detects that it's Opcode is loaded; Was the SM Interrupted Last time (BUSY == 1)? if yes Jump to Cycle 8
2. NOP 2. if not, Load Operand "BytesLow" from the Stack at Address SP+TBL+0
3. Nothing 3. Load Operand "BytesHigh" from the Stack at Address SP+TBL+1
4. Nothing 4. Load Operand "SourceLow" from the Stack at Address SP+TBL+2
5. Stay on Cycle 5 until Interrupt is received 5. Load Operand "SourceHigh" from the Stack at Address SP+TBL+3
5. Stay on Cycle 5 until Interrupt is received 6. Load Operand "DestinationLow" from the Stack at Address SP+TBL+4
5. Stay on Cycle 5 until Interrupt is received 7. Load Operand "DestinationHigh" from the Stack at Address SP+TBL+5; Set BUSY
5. Stay on Cycle 5 until Interrupt is received 8. Load into TEMP Register from Memory at Address SOURCE + BYTES
5. Stay on Cycle 5 until Interrupt is received 9. Store TEMP Register into Memory at Address DEST + BYTES; if BYTES <> 0 Jump to Cycle 8 and Decrement BYTES (Clear BUSY if BYTES == 0)
Repeat until BYTES == 0
5. Stay on Cycle 5 until Interrupt is received 10. End the SM, send a fake Interrupt to the Control Unit
6. End of Instruction; Load next Opcode 11. Nothing
(note that this might change if i go for a 24 bit Address Bus, requiring 2 extra cycles before the DMAC starts it's loop for the 2 extra Address Bytes)
as you can see the actual Moving is a simple 2 cycle loop, 1 cycle to load a byte, and 1 to write it back and Advance to the next Address. (which adds up to around ~0.47MB/s per MHz)
there is also the "BUSY" FlipFlop that gets set at the SM's Cycle 7 (meaning that it will be 1 at the start of Cycle 8 ) and gets cleared at Cycle 9 (only when the loop ends).
the purpose of the BUSY FlipFlop is to keep track if the SM has already started it's loop or not.
so when an Interrupt occours before BUSY is set (ie when it's loading Operands) then the next time the Instruction starts it will start from the very beginning.
but if an Interrupt occours after BUSY is set then BUSY will remain set, so the next time the Instruction starts it will skip the Operand fetching and go straight back to the Loop.
the only exception (like usual) is Abort, when an Abort occours the SM is forced to set the BUSY FlipFlop to 0
I think this should cover ever possible Interrupt case
Dr Jefyll wrote:
Also, I'm dithering over whether it might be better if the instruction Operand were to indicate an offset into the stack (rather than a z-pg address). The stack would probably be more convenient to code for. But adding the Operand to S will probably cost a cycle, and that kinda hurts performance unless that extra cycle is excluded from the restart loop. No doubt it can be excluded, but I expect there'd be a cost in resources. Always tradeoffs...
Hmm, the stack seems to be a convienent place to load Operands from, though i cannot modify the value of the SP while reading Operands, for the case that an Interrupt occours right when the SM loads it's last few operands, because the Interrupt sequence would overwrite the previously loaded bytes on the stack.
the order of bytes on the Stack could be: (top = lower address, bottom = higher address)
- Amount to Move Low byte
- Amount to Move High byte
- Destination Address Low byte
- Destination Address High byte
- Source Address Low byte
- Source Address High byte
Dr Jefyll wrote:
There are many, many ways this could work. You might want to have a look at my 1988
KK Computer, which uses a scheme that's fairly simple and general. There are four bank registers, called K0-K3. K1 to K3 can be invoked by the programmer at will. In other words, they're not tied to any specific function, such as Vector Bank Register. And K0 is the default -- it's what you get when you
don't ask to invoke K1-K3.
I opted to keep zero-page and the stack in Bank 0 at all times, which means K0-K3 are ignored for these accesses. I'm not bothered by keeping stack and z-pg in Bank 0. The important thing (for me) is being able to address data arrays that exceed 64K. (The same 24-bit space can also contain code, but for me the need for code space is secondary compared to the need for data space.)
hmm Interesting idea to have multiple almost orthogonal Bank Registers, but unlike the 65C02 i don't have any spare/blank Opcodes that i could use as additional Prefix Bytes. while i do have 204 unused Opcodes in my extended Opcode Table, i don't like stacking Prefix bytes as the instructions would just get ridiculously long, and I don't think having multiple of the same instruction that just use a different DBR each will be worth the extra opcodes it would take up. So i don't think that idea is right for this Project.
but i could atleast add a few Load Instruction specifically for the DBR (and the other Bank Register), so switching between different Banks is less of a hassle.
for the actual Addressing modes i was thinking of these:
- Absolute Long
- Absolute Long X-Indexed
- Absolute Long Y-Indexed
- Base Page Indirect Long Z-Indexed
And the instructions that make use of those are be: LDA, STA, ADC, SBC, AND, ORA, XOR, CMP, INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, ROR
note that INC, DEC, ICC, DCC, ASL, ASR, LSR, ROL, and ROR only use the first 3 Addressing modes. (which is still more than the 65816 allows)
in total this would add up to 43 new Instructions plus any miscellaneous Instructions like Long (and maybe Indirect Long) versions of JMP/JSR/RTS, Push/Pull, Special Loads/Stores, etc
Overall i'm really warming up to the idea of extra Addressing Space, with the way i have it in mind only the new Instructions i were to add would actually interact/modify any of the new registers, all of the existing Instructions would just work like normal without requiring some rewrite/rewiring. I'll make a copy of the Logisim Version and see if i can modify the circuit to allow for Bank Registers and all that.
though i'm still not sure if an Emulation mode is necessary, the only difference it would have to Native mode is how many bytes Interrupts and BRK push onto the stack, and how many RTI pulls from the stack.
so i'm kinda thinking about saying "screw it" and not implementing an Emulation Mode at all and just hope there aren't many programs that use Interrupts, BRK, and RTI in some weird way that only works with 16 bit addresses.