6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Apr 26, 2024 7:52 am

All times are UTC




Post new topic Reply to topic  [ 34 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Extended 65CE02 Core
PostPosted: Sun Jun 13, 2021 1:05 pm 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
Hi Everyone!

Finally, after 4 Billion Years i have returned with an (atleast somewhat) functional 65CE02 Softcore.
The reason i'm only posting about this now is because i wanted to have the Project at a point where i can throw the CPU on a real FPGA and have it actually run code.
Using my Altera DE2 Dev Board (Cyclone II) i'm able to run this 2700 Logic Element beast of a CPU at 25MHz using a simple Incrementing Program (both ROM and RAM are internal for testing purposes)

from what i know this is basically the second 65CE02 based Softcore ever made, right after the GS4510 used in the Mega65.
I personally don't like the way the GS4510 adds new instructions, IIRC it does it by having sequences of existing instructions activate some extended function, kinda like a cheat code in a game.
in this core I went with the Z80 way of adding Instructions by using a single opcode (AUG/MAP) and using that as a prefix byte to access a seperate opcode table with 256 potential instructions. (only 52 are actually used right now)
this means that as long as a program stays away from the AUG Opcode there is no chance to accidentally execute an extended Instruction.
Anyways, time to talk about the actual CPU itself.

I want to start with Pins.

while i was procrastinating... i mean working on this core i've looked a bit into the 65816 as well. and i really liked it's VPA, VDA, and Abort pins.
so i've added them to this core with some additional V?A pins!

Code:
Name      Full Name                 Direction

VPA     = Valid Program Address     Output
VDA     = Valid Data Address        Output
VSA     = Valid Stack Address       Output
VBA     = Valid Base Page Address   Output
VVA     = Valid Vector Address      Output
ABT     = Abort Instruction         Input

Function:

VVA VBA VSA VDA VPA
0   0   0   0   0   Internal Operation
0   0   0   0   1   Opcode Fetch
0   0   0   1   1   Operand Fetch
0   0   0   1   0   Normal Data Read/Write
0   1   0   1   0   Base Page Read/Write
0   0   1   1   0   Stack Read/Write
1   0   1   1   0   RESET/ABT/NMI/IRQ/BRK Stack Write or RTI Stack Read
1   0   0   1   0   Vector Read

I have also added the MLB pin from the 65C02, though VBP was not added as it's basically just an inverted version of VVA in this case.

I'm hoping that this will allow for some rather complex Memory Systems, for example you could give the Stack it's own 64k of RAM and have it completely seperate from regular Memory. not only would that give you more stack space than you would ever need but it also prevents the stack from accidentally overwriting your regular non-Stack Memory.

and of course there is also the ABT Pin, with one major difference to it's 65816 cousin: it doesn't wait for the current instruction to finish, after an Abort is received the CPU will start the Interrupt Sequence after the next falling egde of PHI2, obviously pulling ABT low right before the falling egde of PHI2 will likely cause problems, so i'd recommend pulling it low on the rising egde of PHI2 and then pulling it high again at the next rising edge.
and the Abort vector is also loacted at 0xFFF8 and 0xFFF9 like on the 65816 in Emulation Mode.


Now let's talk about Instructions!

all of the 65CE02's Instructions are implemented, and most of the cycle times are the same as the original (since it already removes most dummy cycles) but i was still able to speed up some instructions, the following list shows all instructions i was able to speed up by a single cycle, original cycle count is next to the Opcode in brackets:

PHA (3), PHX (3), PHY (3), PHZ (3), PHP (3), PLA (3), PLX (3), PLY (3), PLZ (3), PLP (3), CLE (2), SEE (2), SEI (2), ASR A (2), NEG A (2), BBR (4), BBS (4), BRK (7), RTS (4), RTI (5), LDA (d.SP),Y (6), STA (d.SP),Y (6)

RTN # was sped up from 7 to 4 cycles, which kinda scares me because it makes me think i implemeted the instruction incorrectly if i was able to shave off so many cycles.
but from what i can tell it's literally just an RTS Instruction with an Immediate value used as an Unsigned offset into the Stack...


Now to the new Instructions!
here the whole Opcode table:
Attachment:
soffice.bin_2021-06-13_12-09-59.png
soffice.bin_2021-06-13_12-09-59.png [ 714.76 KiB | Viewed 14169 times ]

I tried to place the Instructions in such a way that they match existing instructions. so for example taking ORA and slapping AUG infront of it turns it into MUL (assuming the addressing mode used also exists for the extended Instruction)
and here the full List of New Instructions (note that both Bytes and Cycles have a +1 to them because of the Prefix Opcode being executed before the actual extended Instruction)
Code:
MUL/MLL   - "Multiply"/"Multiply Low" Multiplies A by a value from Memory, result (low byte) stored in A. (the Carry being 1 indicates that the High byte (aka MLH result) is not equal to 0)

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 09   Immediate      MUL #nn         2+1      2+1
   5C 05   Base Page      MUL nn         2+1      3+1
   5C 15   Base Page X      MUL nn,X      2+1      3+1
   5C 0D   Absolute      MUL nnnn      3+1      4+1
   5C 1D   Absolute X      MUL nnnn,X      3+1      4+1
   5C 19   Absolute Y      MUL nnnn,Y      3+1      4+1


MLH      - "Multiply High" Multiplies A by a value from Memory, result (high byte) stored in A (the Carry being 1 indicates that the Low byte (aka MLL result) is not equal to 0)

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 29   Immediate      MLH #nn         2+1      2+1
   5C 25   Base Page      MLH nn         2+1      3+1
   5C 35   Base Page X      MLH nn,X      2+1      3+1
   5C 2D   Absolute      MLH nnnn      3+1      4+1
   5C 3D   Absolute X      MLH nnnn,X      3+1      4+1
   5C 39   Absolute Y      MLH nnnn,Y      3+1      4+1


MOD      - Divides a value from Memory by A, remainder stored in A. (the Carry being 1 indicates that the High byte (aka DIV result) is not equal to 0)

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 49   Immediate      DIV #nn         2+1      2+1+?
   5C 45   Base Page      DIV nn         2+1      3+1+?
   5C 55   Base Page X      DIV nn,X      2+1      3+1+?
   5C 4D   Absolute      DIV nnnn      3+1      4+1+?
   5C 5D   Absolute X      DIV nnnn,X      3+1      4+1+?
   5C 59   Absolute Y      MUL nnnn,Y      3+1      4+1+?


DIV      - Divides a value from Memory by A, result stored in A (the Carry being 1 indicates that the Low byte (aka MOD result) is not equal to 0)

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 69   Immediate      DIV #nn         2+1      2+1+?
   5C 65   Base Page      DIV nn         2+1      3+1+?
   5C 75   Base Page X      DIV nn,X      2+1      3+1+?
   5C 6D   Absolute      DIV nnnn      3+1      4+1+?
   5C 7D   Absolute X      DIV nnnn,X      3+1      4+1+?
   5C 79   Absolute Y      MUL nnnn,Y      3+1      4+1+?


LWR      - "Logic shift Word Right", Shifts all bits 1 to the right, 0 into bit 15, and bit 0 into Carry

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C CA   Absolute      LRW nnnn      3+1      7+1


AWR      - "Arithmetic shift Word Right", Shifts all bits 1 to the right, bit 15 into bit 14, and bit 0 into Carry

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C CB   Absolute      ARW nnnn      3+1      7+1


RWR      - "Rorate Word Right", Shifts all bits 1 to the right, Carry into bit 15, and bit 0 into Carry

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C EB   Absolute      RRW nnnn      3+1      7+1


ICC      - "Increment with Carry", Adds 0 + C to a Value in Memory

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C E6   Base Page      ICC nn         2+1      4+1
   5C F6   Base Page X      ICC nn,X      2+1      4+1
   5C EE   Absolute      ICC nnnn      3+1      5+1
   5C FE   Absolute X      ICC nnnn,X      3+1      5+1


DCC      - "Decrement with Carry", Subtracts 1 + C from a Value in Memory

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C C6   Base Page      DCC nn         2+1      4+1
   5C D6   Base Page X      DCC nn,X      2+1      4+1
   5C CE   Absolute      DCC nnnn      3+1      5+1
   5C DE   Absolute X      DCC nnnn,X      3+1      5+1


SWP      - "Swap Nibble", Swaps the High and Low Nibble of the A Register.

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 4B   Implied         SWN            1+1      1+1


SXY      - "Swap X and Y", Swaps the contents of the X and Y Register.

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 5B   Implied         SXY            1+1      2+1


SXZ      - "Swap X and Z", Swaps the contents of the X and Z Register.

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 6B   Implied         SXZ            1+1      2+1


SYZ      - "Swap Y and Z", Swaps the contents of the Y and Z Register.

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 7B   Implied         SYZ            1+1      2+1


LDA      - Load Memory into Accumulator

   N V S B D I Z C
   + - - - - - + -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C E2   Stack Y         LDA SP,Y      1+1      1+2
   5C E1   Double Indirect   LDA (nn,X),Y   1+2      1+5
   5C F1   PC Relative      LDA PC,XY      1+1      1+2


STA      - Store Accumulator into Memory

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 82   Stack Y         STA SP,Y      1+1      1+2
   5C 81   Double Indirect   STA (nn,X),Y   1+2      1+5
   5C 91   PC Relative      STA PC,XY      1+1      1+2


CHB      - "Convert Hex to BCD", Converts a Binary Value 0x00 - 0x63 to a BCD Value 0x00 - 0x99

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C E8   Implied         CHB            1+1      1+1


CBH      - "Convert BCD to Hex", Reverse of CBH

   N V S B D I Z C
   + - - - - - + +
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C C8   Implied         CBH            1+1      1+1


RND      - "Random", Loads a Byte from the LFSR into the Accumulator

   N V S B D I Z C
   + - - - - - + -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 3B   Implied         RND            1+1      1+1


PHR      - "Push Registers", Pushes the Z, Y, X, B, A Registers onto the Stack in that Order

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 48   Implied         PHR            1+1      6+1


PLR      - "Pull Registers", Pulls the A, B, X, Y, Z Registers from the Stack in that Order

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C 68   Implied         PLR            1+1      6+1


WAI      - "Wait", Halts the CPU until an Interrupt (IRQ (if enabled), NMI, ABT, RESET) occurs

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C EF   Implied         WAI            1+1      ∞


STP      - "Stop", Halts the CPU until an Interrupt (ABT, RESET) occurs

   N V S B D I Z C
   - - - - - - - -
   
   Opcode   Addressing      Mnemonic      Bytes   Cyles
   -----------------------------------------------------
   5C FF   Implied         STP            1+1      ∞


now a bit more detail about the Instructions.

MUL, MLH, MOD, and DIV are pretty straight forward, their placement in the Opcode table matches ORA, AND, XOR, and ADC in that order.
though the actual Division circuitry is not really ready yet (which is why the Cycles have a +? next to them). i'll likely end up making a seperate state machine that does a simple subtraction loop.
it won't be fast but still faster and more compact than doing it in software.

LWR, AWR, and RRW, not much to say either, they are just counterparts to the existing ASW/AWL, and ROW/RLW, and are therefore also placed in the same opcodes (exception being LWR as it has no counterpart)

ICC, and DCC, Interesting Instructions that allow you to Increment/Decrement multi-byte long values in Memory without using Branches or long LDA/ADC/STA functions. note that like ADC/SBC they are effected by the D Flag. their placement in the Opcode table match INC and DEC

SWP, SXY, SXZ, and SYZ, again pretty straight forward. their placement in the Opcode table matches TAZ, TAB, TZA, and TBA in that order.
also note that none of them update any flags.

LDA/STA SP,Y, LDA/STA (bp,X),Y, and LDA/STA PC,XY, some crazy looking Addressing Modes. the LDA Instructions don't directly line up with existing LDA Instructions.
SP,Y is the simpliest one, it Accesses Memory at the location of the current Stack Pointer with Y added as an unsigned offset. note that the Stack Pointer is not modifed during this and that this counts as a Stack Read/Write Operation so VSA will be pulled high.
(bp,X),Y or "Double Indirect" is just both of the regular X/Y Indirect Addressing Modes mashed together. i've heard on here this is useful to do stuff like accessing an array of data from a table of pointers, all in a single instruction.
PC,XY the weirdest Mode, it Accesses Memory at the location of the Program Counter (ie opcode of the next instruction) with the 16 bit XY pair being added as an Offset. (X = High Byte, Y = Low Byte)

CHB, and CBH, allow you to convert back and forth between Binary and BCD, Carry gets set if the input was invalid

RND, not much to explain either, the CPU has a built-in LSFR that is ALWAYS running as long as there is a Clock signal connected to the CPU, it completely ignores anything else going on the CPU like Resets, RDY being pulled low, etc.
it's not really intended to be used as an RNG itself, but it can be useful to give a seed for an actual RNG Function.

PHR, and PLR, not much to say i just thought these would be useful for Interrupt Routines or debugging something.

WAI, and STP, both of these function similarly to their 65C02 versions. noteable difference:
WAI doesn't continue execution after itself if it received an IRQ while the I Flag is set, instead it will just ignore it.
also because my Aborts actually cancel the current instruction they can be used to escape both a WAI, and STP


and finally, the actual Files.

EDIT: I uploaded everything Important to a Github Repo, everything can be downloaded from there: https://github.com/ProxyPlayerHD/65CE02-Softcore
keep in mind that i'm horrible at writing documentation so a lot of the files just kinda exist without explanation.
and also note that there could still be bugs hiding inside the CPU and i will take some time to iron those out (specifically Interrupts and edge cases).

and my work flow still consists of just using a Logic Simulator to generate Verilog code for me instead of writing it myself.
as for the actual useable Verilog Files, they are in the "Digital" folder "CE_M_CPU_TOP.v" is just the CPU with a bidirectional Data Bus. while "CE_M_CPU.v" is the same but with a seperate Data Input and Output (for when the CPU is not the Top Module).

on my Cyclone II the entire CPU takes up around 2700 Logic Elements. which is pretty massive...
I plan on actually building a Computer around this CPU and thought to use an iCE40HX 4K FPGA, with around 3500 Logic Elements it should be enough for the CPU, a VGA Controller, and some extra Logic.
problem is, the Lattice Synthesis Engine doesn't like the CPU and just crashes when trying to synthesize it. it works fine when i use other circuits that i've made with the Logic Simulator. and i can also synthesize each individual part of the CPU (ALU, Control Unit, Registers) fine, but when they are all together it just won't work. it gives me a cryptic error message too
Code:
Done: error code -1073741819

i only found 2 matches on google and neither were really helpful.
I'll probably try to talk to the Dev who made Digital (Logic Simulator) to see if maybe he got an idea what is going on. if not i'd either have to go with another FPGA (Xilinx or Intel) or i have to contact Lattice directly and ask them to fix their outdated software... (they likely won't).

.

anyways next up i'll probably try to either make an Interface with the FPGA's external SRAM or writing a more thorough testing program that goes through all Instructions and also includes Interrupts and such.

but that is for later, right now i just want to finally share the progress i have made so far. and obviously none of this is locked down or anything. if anyone wants to rewrite the whole thing in proper Verilog or just make their own core with a similar idea, then go ahead!
While i doubt anyone is ever gonna use this core in an actual project (besides myself) i atleast hope that it inspired some people to make their own extended cores.
an extended 65816 in the style of what i have done here would be interesting. with things like a Supervisor/User Mode, 24-bit Program Counter Option, moveable stack/direct bank, etc.
then again at that point you might as well just implement a 65k softcore.

and on a little side note while working on this i also came up with some other circuits, one of which is a "Programmable Wait State Generator". I haven't seen anything similar to it on this forum so if anyone is interested i can make a new thread about that. here a summary: https://pastebin.com/J66WgDnH
another circuit i came up with was a complete Tile or Bitmap based VGA Controller that barely fits into a single ATF1508 CPLD.

welp, that is the entire post. i don't know how to end these things lol. tell me your thoughts, ideas, things i've missed about the CPU, etc.


Last edited by Proxy on Mon Jun 14, 2021 6:15 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sun Jun 13, 2021 8:47 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Thanks for the update!

Just one thought about a multi-cycle divide, or long-running instructions generally, is that they could impact interrupt response. Which might be worth thinking about. Or, at a high enough clock frequency, that might be little enough time that it's no problem.

You might like to try the open source toolchain for lattice (IceStorm) - it might be less crashy, and if it does reject your verilog, it might say why.

I have some comments which aren't really about your milestone, but about the best way to share it.

I think it would be better if you could share your project sources without all the tools included - at least, it should be much smaller. Ideally, upload your zip as a github or gitlab project.

For example, you've included the executables for
https://github.com/hlorenzi/customasm
and Logisim-Evolution and Programs/Digital

You've also included a couple of large PDFs which are already available on the web: 65CE02.pdf (12 pages) and C65 System Specifications Preliminary 1991.pdf (332 pages)

So, if you can omit all that, you'd only be sharing 5Mbyte and could probably attach it to your post here on the forum. It will also make it easier for people to study and comment on your work.

Your wait state idea could be interesting: please share that in a new thread. Best to just put the information in the post, rather than indirecting through a text sharing site.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 14, 2021 4:20 am 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
Yea after work I'll make a GitHub repository and upload everything unique onto there, everything else like the programs I'll link in the README.md

And I thought using Pastebin would be better because I didn't want the post to be too long to scroll through.
I'll edit that after work as well.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 14, 2021 11:29 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Sounds great! There is a bit of a tradeoff to be made about very long posts, but I think there's great merit in having materials close at hand. Also, this forum is very good at preserving very old materials, whereas many images and attachments placed at third party sites have been lost.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 14, 2021 6:38 pm 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
so, it's done! (the edits not the CPU, sadly)

now to actually comment on what you said about the Division... yea i kinda knew that already but i didn't really think about it. i thought "this state machine idea is better than a software routine because it's atomic and cannot be interrupted" but rewording that to "bad interrupt latency" makes me realize how bad of an idea that would be.
Assuming 1 iteration can be done per clock cycle the longest the state machine would need would be ~85 cycles for 255 / 3. because dividing by 0, 1, and 2 (or any power of 2) can be optimized away easily.
I also thought about using an external SPI Flash to store all pre-calculated values, but the time it would take to access the Chip and read a single byte from it would already take longer than the worst case for the state machine.

so the options are:
  1. use a cheap and Atomic State Machine and sacrifice Interrupt response time. (by as much as ~90 cycles)
  2. use a more expensive non-Atomic State Machine (ie every interrupt turns into an ABT when the state machine is active).
  3. throw away the State Machine and build a real faster Divider Circuit that will take up even more resources on the FPGA
  4. throw away Division completely and replace it with Signed Multiplication

and i'll try IceStorm sometime later, maybe it will give me better results.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 14, 2021 8:20 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Lots of choices there! I think that's the nature of the beast. Some architectures will abandon a long instruction and restart it, some will store the intermediate state somehow, some will put up with a long wait.

The state of the art with division seems to be, if you have a fast multiplier, to use multiplication and an iterative convergence. (The actual state of the art also tries very hard to pick a good starting value and to stop as soon as possible, but that's both large and complex, and also difficult to get right.)


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Mon Jun 14, 2021 9:02 pm 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 293
Proper division is much faster, and with a bit of clever register re-use can be done with fewer resources.
There are 3 registers, which I will call X, Y, and Z. Start with X = 0, Y = dividend, Z = divisor (Z never changes, so it might not need a register of its own). Then repeat N times:
* shift XY left one bit, shifting in a 0
* If X-Z is not negative, write X-Z to X and set bit 0 bit of Y.
That's it. At the end, Y will contain the quotient and X the remainder.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Tue Jun 15, 2021 12:34 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
One approach that I've not seen discussed as part of this thread is the implementation of these long instructions as "partial" implementations of the algorithms. In other words, when implementing the "mov" instruction for my M65C02A soft-core, I decided to implemented as a non-interruptible block move or as a single cycle block move which allowed the instruction to be interrupted after each move cycle. In the non-interruptible mode of the instruction, the entire block was moved, and I assumed that the programmer understood that it would potentially introduce a significant amount of interrupt latency. In the single cycle mode, I made the instruction execute a single move operation, and leave the P register Z flag set. If the single cycle move instruction was followed by a nop and a bne instruction, then the number of cycles that the single cycle instruction would require would match that of the '816 mvn/mvp instructions.

In this manner, I allowed the M65C02A mov instruction to be interruptible without having to include significant additional complexity to store / restore the internal state of the instruction.

Just a suggestion.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Wed Jun 16, 2021 6:29 pm 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
Alright, i have decided to go with both 1. and 3.
i sat down and made a proper division circuit/state machine, but it will not be interruptable. (except for Abort and Reset obviously)
I used this site (plus John West's Post) to design the whole circuit: https://projectf.io/posts/division-in-verilog/
overall it was a lot simplier than i thought it would be, i was never able to find an actual example logic circuit online, just example calculations, pseudocode, and text descriptions of the steps...

anyone who uses Division Instructions has to be aware of the potential Interrupt Lag they can cause.
though it's only 8 cycles for any division (there is no divide by 0 check or anything, any operation in it will always take exactly 8 cycles), so the longest division instruction (DIV/MOD Absolute) would need a total of 13 Cycles.

but that could also become 14 cycles depending on how exactly i implement the Control Unit waiting for the State Machine to finish.
my idea was to use one of the empty places in my conditional micro-branch selection thingy.
basically i have a few Micro-Instruction that check for some bit being set (a flag or data bit from Memory), if it is set the Cycle counter is set to 5, if not it will increment normally. but the Demultiplexer used to select the bit to check is larger than it actually needs to be, so without much effort i could add a bit for the State Machine, so the Control Unit would have a Conditional check at Cycle 5 and if that bit is set (ie the State Machine is still working) then it will set the Cycle counter to 5 again, making it loop for as long as the State machine is running.

i use the same exact thing for the WAI Instruction btw, i just have it contionally jump to the same cycle over and over until an Interrupt is caught and it can break free of the loop to finish the WAI Instruction, which in turn starts the Interrupt Sequence.

I'll post an image of the current circuit i have, but i won't go into much detail as i want to completely redo it cleaner for the CPU. but the register names and Labels (X/Y/A/Q) are identical to the site i linked so if anyone really wanted to they could probably pice the circuit's function together.
Attachment:
javaw_2021-06-16_20-23-26.png
javaw_2021-06-16_20-23-26.png [ 158.1 KiB | Viewed 13979 times ]


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Wed Jun 16, 2021 6:49 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
14 cycles doesn't seem so bad - especially if we're talking about running faster than 1MHz. Usually interrupt latency is caused in cycles but experienced in nanoseconds.

I note the 68000 has a worst case latency - even with a zero wait state memory - of 378 cycles. That's because MOVEM is sensitive at the start of the instruction, and DIVS at the end. And the interrupt acknowledgement sequence takes 58 cycles itself. (See page 2 of the pdf)

Attachment:
File comment: excerpt from Motorola App Note AN1012
AN1012-mc68k-excerpt-interrupt-latency.png
AN1012-mc68k-excerpt-interrupt-latency.png [ 155.44 KiB | Viewed 13973 times ]


Last edited by BigEd on Wed Jun 16, 2021 8:49 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 19, 2021 11:06 am 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
alright Division and Modulo are now finally implemented, and run on the FPGA at 25MHz pretty well! (though again there is still a lot of testing to do at those high speeds)
the total amount of Logic Elements didn't go up by a lot either, it's still just below 2700 LE's.
i hopefully did the timing correctly, i just had a cycle counter going while single-stepping my test program and looked at how many cycles it was when it fetched the opcode and when it fetched the next Opcode to calculate how many cycles the Instruction took, here the times (they are the same for DIV and MOD so i'll only show one):
Code:
    N V S B D I Z C
    + - - - - - + +
   
    Opcode  Addressing      Mnemonic        Bytes   Cyles
    -----------------------------------------------------
    5C 49   Immediate       DIV #nn         2+1     11+1
    5C 45   Base Page       DIV nn          2+1     12+1
    5C 55   Base Page X     DIV nn,X        2+1     12+1
    5C 4D   Absolute        DIV nnnn        3+1     13+1
    5C 5D   Absolute X      DIV nnnn,X      3+1     13+1
    5C 59   Absolute Y      DIV nnnn,Y      3+1     13+1

Assuming an Interrupt coming in during the Opcode Fetch of the lonest variant of the instruction it would take around 14 cycles for the Interrupt to be handled, which at perfect 25MHz Operation is 560ns (0.56µs), even at the plan-B speed of 12.5MHz it would still just be ~1.12µs. and i think any user should be able to handle their keyboard/IO stuff taking around a microsecond longer than usual, especially since modern PCs deal with Input lag in the milliseconds.
also cannot really think of anything timing critcal that couldn't spare a microsecond, even if there was something you can just use an Abort to really make sure the CPU makes it in time.

speaking of interrupts, i also changed some things about their Logic, specifically NMI and ABT are now being sampled at the falling edge of PHI2 instead of being asynchronous as it could otherwise cause some timing issues.

anyways, division and Interrupt latency aside, what about all the other features of the CPU?
the new addressing modes like the Stack indexed and PC Relative Indexed ones for example.
my attempt to complement the 65CE02's Instruction set by adding the Right Shifts/Rotate to it's own left Shifts/Rotate.
or just the new instructions in general. for example i'm not entirely sure about keeping CHB and CBH since BCD doesn't really seem to be a very commonly used feature of any 65xx CPU.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Thu Jun 24, 2021 12:33 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3346
Location: Ontario, Canada
Proxy wrote:
note that both Bytes and Cycles have a +1 to them because of the Prefix Opcode being executed before the actual extended Instruction
I hope you've remembered to account for the case of an interrupt arriving after the prefix and before the instruction which the prefix affects. You need to either disallow interrupts during this brief time or else somehow remember the fact that a prefix occurred in order to ensure its effect is not ignored after the ISR completes and normal processing resumes.

Two other suggestions. Have you considered a block-move instruction? Probably your 8-bit registers will limit you to moves of 256 bytes or less, but if necessary the user could put the block-move inside a loop. The overall speed would still be impressively high.

Finally, what about some extra address lines -- something to break the 64K barrier? There are zillions of different ways this could work, and it is NOT necessary to limit yourself to the stilted tradition of working through a "window" between $8000 and $C000 (for example).

It seems to me the main benefit of additional address space is to meet the need for data (which can balloon in a way that the need for code space does not). Data logging and video buffering are two example applications; another is large lookup tables, which can be used to simulate instructions such as multiply. Memory is cheap nowadays. I'd say you're missing an opportunity if you don't make plenty of it accessible!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Thu Jun 24, 2021 3:00 pm 
Online
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 745
Location: Germany
Dr Jefyll wrote:
Proxy wrote:
note that both Bytes and Cycles have a +1 to them because of the Prefix Opcode being executed before the actual extended Instruction
I hope you've remembered to account for the case of an interrupt arriving after the prefix and before the instruction which the prefix affects. You need to either disallow interrupts during this brief time or else somehow remember the fact that a prefix occurred in order to ensure its effect is not ignored after the ISR completes and normal processing resumes.

Yep, i mentioned it on the gthub page in the "INST_Indexed" file, but i don't blame you for not seeing that. i'm very messy with where i put information like that.
Basically the AUG Instruction gets fetched and sets the "OP2" FlipFlop inside the Control Unit, that FlipFlop controls if the next instruction being fetched executes from the Main or Extended Opcode Table. it is also responsible for suppressing any IRQ and NMI until it is cleared. ABT and RES still work like normal.

Dr Jefyll wrote:
Two other suggestions. Have you considered a block-move instruction? Probably your 8-bit registers will limit you to moves of 256 bytes or less, but if necessary the user could put the block-move inside a loop. The overall speed would still be impressively high.

hmm, i could do something similar to my Division Circuit and just have a seperate State Machine that does all of the work and have the Control Unit run in a loop until the State Machine is done. it would basically be a built-in DMA Controller that activates via some custom Instruction. but how would Inetrrupts work with it? i could pause the State Machine when an Interrupt occours but how would it know when to continue? once it returns from the Interrupt it would just execute the next instruction (unless it's an abort). and tracking RTI Instructions wouldn't be worth it for the theoretically infinite amount of nested Interrupts.
i could have the DMAC modify the PC being pushed to the stack, so that it always goes back to the DMA Instruction, and then have the Instruction itself check if the DMAC was interrupted, if yes just continue the loop if not load it's registers an then start the loop.
either way it doesn't seem like a lot of extra logic comapred to the Division State Machine (difference being that this DMAC has to take over the CPU's Address/Data Bus and some Control Signals, while the Division State Machine is basically just a black box)

Dr Jefyll wrote:
Finally, what about some extra address lines -- something to break the 64K barrier? There are zillions of different ways this could work, and it is NOT necessary to limit yourself to the stilted tradition of working through a "window" between $8000 and $C000 (for example).

It seems to me the main benefit of additional address space is to meet the need for data (which can balloon in a way that the need for code space does not). Data logging and video buffering are two example applications; another is large lookup tables, which can be used to simulate instructions such as multiply. Memory is cheap nowadays. I'd say you're missing an opportunity if you don't make plenty of it accessible!

well my idea was to just have people use the extra pins to extend the Address Range. as said in the main post you can for example use VSA with some simple logic gate ICs to seperate the stack from the rest of Memory, giving the Stack it's own 64k of RAM. same with VPA and VDA, both of them allow you to seperate Data and Program Memory, you can built an MMU that allows for seperate pages for each type of Memory (Program, Data, Base Page, and Stack). it won't technically get you over the 64k limit but it would be a much easier and faster option compared to having the CPU do it internally.

if i were to add a larger address range i would probably go the 65816 route. but i with 4 (maybe 5) extra Registers for the upper 8 bits of the address bus, instead of just 2.
Program Bank Register, Data Bank Register, Base/Direct Page Bank Register, Stack Bank Register, and maybe Vector Bank Register too.
but like the 65816 i would need to add a mode switch bit somewhere to turn these extra registers on/off as things like Interrupts and some Subroutine Calls would have to push the PBR as well, breaking compatibility.
i would also need extra 24bit addressing modes which would require most of the Address ALU to be rebuilt. and knowing myself i would likely also add an extra mode bit (that i would love to have the 65816) that changes the PBR to be updated by the PC over/underflowing, pretty much turning the entire 16MB Range into a Flat Memory Model. allowing programs to be larger than 64k without having to use Long Jumps and such.

overall it would be possible but it would also add quite a lot of Logic and probably require extra bits on the Control ROMs (which to be honest isn't that bad as there are only at most 256 unique values in those ROMs)
i do quite like the idea, but i'm just not completely sure about every detail when it comes to actually implementing it.


Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Fri Jun 25, 2021 4:29 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3346
Location: Ontario, Canada
Proxy wrote:
how would Inetrrupts work with [a block-move instruction]?
It's slower than DMA, but one approach would be to create an instruction that moves just one byte, and can automatically execute repeatedly (kind of like what the '816 does with MVP and MVN).

In the following example, the source and destination addresses are held in a combination of registers and zero-page. The usage is unusual. X and Y hold the least significant bytes of the addresses -- the address "low" or ADL bytes. The two AHD ("high") bytes are held in two consecutive bytes in z-pg. Each iteration would go like this:

  • Cycle 1: read the instruction Prefix.
  • Cycle 2: read the instruction Opcode
  • Cycle 3: read the instruction operand. It points to the z-pg location of the two bytes. Let's say the operand is $42.
  • Cycle 4: read location $42. This gives us the ADH of the Source address.
  • Cycle 5: read a byte to be moved. The address is formed by ($42) concatenated with X. Increment X.
  • Optional Cycle 5a: if X overflowed (ie; is now =0), increment the ADH value and write it back to $42.
  • Cycle 6: read location $43. This gives us the ADH of the Destination address.
  • Cycle 7: write the byte to be moved. The adress is formed by ($43) concatenated with Y. Increment Y. Decrement Z (which holds the count). If Z <>0, don't advance PC. Instead, roll it back so it points to the Prefix again.
  • Optional Cycle 7a: if Y overflowed (ie; is now =0), increment the ADH value and write it back to $43.


Quote:
if i were to add a larger address range i would probably go the 65816 route. but i with 4 (maybe 5) extra Registers for the upper 8 bits of the address bus, instead of just 2.
There are many, many ways this could work. You might want to have a look at my 1988 KK Computer, which uses a scheme that's fairly simple and general. There are four bank registers, called K0-K3. K1 to K3 can be invoked by the programmer at will. In other words, they're not tied to any specific function, such as Vector Bank Register. And K0 is the default -- it's what you get when you don't ask to invoke K1-K3.

I opted to keep zero-page and the stack in Bank 0 at all times, which means K0-K3 are ignored for these accesses. I'm not bothered by keeping stack and z-pg in Bank 0. The important thing (for me) is being able to address data arrays that exceed 64K. (The same 24-bit space can also contain code, but for me the need for code space is secondary compared to the need for data space.)

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Sat Jun 26, 2021 12:23 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Extended 65CE02 Core
PostPosted: Sat Jun 26, 2021 12:22 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Dr Jefyll:

Another elegant suggestion. Your suggestion is quite elegant, effectively uses some of the best attributes of the 6502 architecture, and easily limits the interrupt latency without undue additional complexity in the core. Not quite as elegant as your suggestion to me to incorporate the Forth VM registers directly in my core, but still very nice and simple to implement. :)

Since I extended all the registers to 16 bits in my core, I did not need to load the upper byte of the address like your suggestion requires. My core's mov instruction has a similar single cycle transfer mode which limits the interrupt latency to 3-5 clock cycles. Trying to make the block move mode of my core's mov instruction interruptible proved to require a lot of additional internal complexity in my core that I did not want to add.

_________________
Michael A.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 34 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 22 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: