Sorry i haven't posted in a while on this thread.
I've been working on other projects, like trying to get CC65 to behave or trying to get my ATF1508 based VGA Card to work.
so far both projects have been failures, if i'm stuck on either for a while longer i might just open a thread about them.
even though i haven't really been working on this core (which i now renamed to "65CE816" or just "E816" for short) i have been making some plans on where i want to take this thing.
to start, the X, Y, and Z Index Registers will be extended to 16 bits to make working with data structures larger than 256 bytes less painful.
though to retain compatibility the high byte of each Index Register will only be effected by special instructions. so for example if X contains 0x00FF and the CPU executes the regular "INX" Instruction, the result will be 0x0000 (ie the low byte doesn't overflow into the high byte).
there will be a total of 6 extra instructions for each index register: 16-bit Incrementing/Decrementing, 16-bit Compares (Immediate, Basepage, and Absolute), and swapping the High and Low Byte.
another benefit of having more Internal Registers is that i can redo the MOV Instruction to require fewer operands.
for example the X Register could contain the lower 16-bits of the Source Address, while the Y Register has the Destination Address. the Z Register could hold the amount of bytes left to be MOVed, and the upper 8 bits of the Source/Destination Addresses can be stored on the Stack.
next up, i think i'll be removing the MUL/DIV State Machines.
specifically i've been thinking about Dr Jefyll's advice of implementing these multi-cycle ALU Instructions as "partial" Instructions.
so instead of a MUL/DIV Instruction doing all 8 steps of the algorithm internally, it would only do a single step in the algorithm per instruction... so you have to execute the same instruction 8 times in a row to get your final result.
The reason why i have been thinking about doing it like that is because now i have 3 extra 8-bit Registers to play with (the upper 8 bits of X, Y, and Z), allowing me to store the intermediate results of each algorithm step in user accessible registers, which in turn allows Interrupts to occour between MUL/DIV Instructions without breaking aynthing.
Also this would remove all Addresing modes, since i want the algorithms to work entirely on internal registers to avoid having to constantly access Memory and waste cycles re-reading the same data 8 times in a row.
ultimately this would leave me with 2 Opcodes from the original 24. MUL, and DIV. both of them would output the full 16 bit result instead of having separate instructions for either half like before (MUL/MLH, and MOD/DIV).
maybe i'll also add signed version of MUL and DIV in the future if the need for them is large enough.
overall this should reduce the complexcity of the ALU while also decreasing Interrupt Latency, at the expensive of having the user manually set up all the Registers before doing any Multiplication/Division.
I'm also planning on redoing the Interrupt Logic (again), cleaning it up a bit. having ABORT as it works right now (ie actually cutting off the current Instruction) casues a lot of edge cases and i'm very tempted to just go the 65816 route and have it wait for the current instruction to finish, as that makes the whole logic around it just way simpler, at the expensive of having ABORT be as slow as a normal Interrupt like on the 65816.
which shouldn't be much of an issue as this Processor is intended to run at ~25MHz.
Sheep64 wrote:
I knew you were working on something impressive but I didn't expect even a preliminary version before Dec 2021. Indeed, I expected
progress from randyhyde in Aug 2021 and long before you. Your work is simultaneously modest (one prefix) and bold (multiply, divide and modulo).
Thank you, i specifically made this thread after i had a working prototype to avoid having plans upon plans without anything solid ever coming out.
Sheep64 wrote:
With FPGA, you have the luxury of inventing your own signals. Separate read and write signals would be very welcome. This would skip the typical three NAND gate qualified write logic of many 6502/65816 boards. It would also aid RC2014 or RC6502 compatibility.
hmm, i guess i could add an inverted version of the RW output, that way you could connect one of them to the WE pin on your Memory, and the other to the OE Pin. and it would still be compatible with a regular 65Cxx System.
Sheep64 wrote:
Indeed, if you have opcodes spare, it may be worthwhile to consider one page I/O instructions. These would work like zp,X but for an I/O segment. With or without a Z80 style /IOREQ, you may be able to simplify your signals.
I got some mixed feelings about IO Instructions. On one hand they are basically just more limited Memory Instructions with a pin on the CPU telling the system that it's IO, but on the other hand it makes it less likely for programs to accidentically write to IO when they bug out/crash.
hmmm... something to think about i guess.
Sheep64 wrote:
Primarily, this would be "Program Segment", "Data Segment" and "Other Segment" where A8 and A15 may be used to determine Direct Page, Stack, I/O Segment or Vectors. In the trivial case, these would be the first two or last two pages. At the very least, you'll make drogon happy given that the Ruby anniversary computer also uses page $FE for I/O.
i'm not really sure what you mean by this.
while you can relocate the Base (Direct) page and the Stack there is no reason to limit them to some specific page in Bank 0 as (unlike in the 65816) you can place them anywhere within the 24-bit Memory Range.
Honestly i recommend allocating 1 Bank per segment, so that's 64k for the Base Page, 64k for the Stack (with the SP in 16-bit mode), and the Vector table is immovable so that just stays in Bank 0.
plus this is exactly why i added the extra V*A pins, so you can see if the CPU is accessing the Base Page, Stack, Vector Table, etc. without having to keep track of where they are located in Memory.
Sheep64 wrote:
aleferri solved atomic operations on 6502 in May 2021. The solution is quite simple. One or more of the read-modify-write instructions should conditionally write. With your prefix instruction, you could prefix all read-modify-write instructions and make the write phase conditional for all of them. This simplistic arrangement has the benefit that it operates in a fixed number of clock cycles. Therefore, don't try to optimize the last bus cycle of atomic operations.
I'm confused... I haven't noticed anything wrong with RMW Instructions, so what exactly got "solved" and why does the solution involve optionally throwing away the result of the Instruction?
also a lot of the Prefixed Opcodes that correspond with the existing RMW Instructions are already used, for example the Prefixed versions of "Increment" and "Decrement" are: "Increment with Carry" and "Decrement with Carry".
Sheep64 wrote:
S+X stack indexing requires addition in some form. Personally, I'd maintain S+X as a hidden register. If you want to implement this more frugally, don't be concerned about an additional bus cycle for ALU operation because the ALU cycles may overlap with opcode fetches. When prefix is fetched, transfer S to a hidden register. When an instruction is fetched, add or transfer X to a hidden register. In either case, X or S+X is ready for use. Unfortunately, this speculative execution might interact badly with interrupts. FPGA ALU operation may also consume more energy than a mux for S+X.
S+X has been changed to S+Y since CC65 likes to use A and X for 16-bit values. and the Y Register doesn't get much love.
also my Core has a seperate ALU for Address Calculations, the Control unit just tells the Address ALU what Addressing Mode it wants and ta-da it's on the output, no extra cycles or registers required.
Sheep64 wrote:
I see that Dr Jefyll has led you into the quagmire of block copy. You are clever to initiate transfer from stack references but this arrangement has nowhere to dump state when transfer is stopped prematurely. Given the vaguries of memory - especially beyond 64KB - a blitter peripheral may be preferable.
Agumander's GameTank blitter copies two dimensional areas and Radical Brad's blitter copies two dimensional areas with alpha channel and run length encoding. I recommend a one element FIFO and interrupt arrangement. This allows solid blitting with no loss of cycles. Power of two sprite scaling would be highly appreciated because, quite honestly, I'm tired of 6502 games below the standard of Out Run. Sprite scaling is too much to ask in discrete implementation but if you want block copy which matches or exceeds 65816 MVP/MVN, two dimensional copy covers more cases.
eh, i just wanted to have any kind of MOV Instruction so it feels closer to the 65816. and while having something Blitter-ish would be sweet for bitmap graphics, it also sounds like it would be a pain to implement and add a LOT of extra logic.
Plus I never intended my MOV to be better than MVP/MVN, I just wanted it to be somewhat equivalent.
and with the current way i want to implement it there should be no Interrupt problems at all as the state of the interrupted MOV is stored in user accessible registers and the Stack.
Sheep64 wrote:
May I be the first to suggest the additional quagmire of floating point? I believe that a very minimal FPU can be implemented with eight instructions or less. All instructions would be abs,X only and the implementation would not conform to IEEE754 or provide conversion functions. Instructions would load, store, add, subtract, compare and multiply. This would apply to a separate accumulator and common flag register. If you choose a suitable representation then DEC abs,X provides floating point right shift and atomic DEC abs,X provides right shift without underflow. Aha! Overall, this arrangement is more suited than x87 to the Raphson approximation of inverse square root.
If you work on an FPU, you should definitely read the FPU chapter of the John Wiley & Sons book Advanced FPGA Design by Steve Kilts. You should also read the 1987 internal INMOS documents regarding the T800 Transputer FPU. The INMOS documents have been published by David Mays. The T800 FPU document explains how to verify an FPU and was written a mere eight years before Intel Pentium Errata #2: FDIV Bug.
Floating Point is a rabbit hole that i try to avoid because of how easily you can get sucked into it.
Doing Fixed Point math in Software should be fast enough for now (especially with the hardware MUL/DIV Instructions), but if i ever get into Floating Point i'll probably do it as a seperate Memory Mapped Peripheral instead of directly building it into the CPU. (that would also mean a regular 65C02/816 can also make use of the FPU as it's just a Peripheral in an FPGA/CPLD that connects to the 65xx Bus)
Sheep64 wrote:
I asked a sheep in the particle physics field if CERN used hyperbolic tangent in relativistic calculations. The answer was "Not knowingly." Therefore, you can dump all of the stupid grandstanding floating point functions. Instead, do sine and cosine concurrently with successive approximations of rotation and implement the remainder with Maclaurin, Taylor and Raphson approximations.
I'm sorry but I only understood like 3 words of that.
Sheep64 wrote:
If you are experimenting with stack introspection, you may want to add a constant to RegS or implement
PHS/PLS. Given the association of RegS and RegX, PHS/PLS is most logically implemented by escaping PHX/PLX. Unfortunately, pushing the current stack pointer onto the stack is useless. However, a computed value aids functions with a variable number of parameters.
so basically an Instructions that Pushes the current SP + some Value to the Stack?
that sounds similar to the LINK Instruction on the 68k.
though i don't see the point in having that since if you want to access a variable amount of data or pointers on the stack you can just use either the "Stack Y-Indexed" (SP,Y) or "Stack Indirect Y-Indexed" ((d.SP,Y)) Addressing Mode.
With Y being a 16-bit Register now you can easily access the whole 64k of Stack (if the SP is in 16-bit mode)
Sheep64 wrote:
You may also want to experiment with separate program and data stacks. This may aid implementation of Forth. It is also of general interest. This could be activated with an escaped flag; possibly operating on an escaped RegQ flag register. For downward compatibility, it could instead be implemented as an opt-in, per use prefix. Unfortunately, an alt stack may be incompatible with A8/A15 segment signals. Separately, you may want to investigate a privilege system where escaped CLV drops privileges and has no counterpart to increase privileges. Likewise, PHQ/PLQ would be privileged.
I have thought about a System/User Stack and Supervisor/User Mode like on the 68k. but ultimately decided against it. though i made it much easier to implement such features with external glue logic by having more V*A Pins than the 65816 has.
using RW (Read/Write), VVA (Valid Vector Address), and VSA (Valid Stack Address) the system can easily detect if the CPU is accessing the Stack because of an Interrupt.
Code:
RW VVA VSA
X 0 X -> Non Interrupt Related Access
1 1 0 -> CPU Reading Interrupt Vector
0 1 1 -> CPU Writing Interrupt Return Address to Stack
1 1 1 -> CPU Reading Interrupt Return Address from Stack
using these signals you can easily build a simple memory mapper that replaces the upper 8 bits of the Address bus (A16-A23) with some hardwired value, which basically means all Interrupt Return Addresses will be written to and read from their own 64k Bank instead of where the actual Stack currently is.
the same logic can be used to Enter/Exit some kind of Supervisor mode, for example when an Interrupt Return Address gets written Memory Protection could be disabled, and when the Address gets read Memory Protection will be restored. (you can also use a counter if you expect nested Interrupts)
Sheep64 wrote:
Finally, I'm a keen proponent of abs,Z. You may wish to substitute all abs addressing with abs,Z. If RegZ is reset and never modified then STZ and abs,Z behave as expected. However, applications which populate RegZ have abs,X, abs,Y and abs,Z addressing modes and must zero one register to obtain abs. This arrangement is never worse than 65C02 and it is a significant boon as address-space and register width increases. Specifically, the semantic switches from abs+R to R+offset. Most immediately, for your extension, it allows (abs,X),Y or zp,X and sp,Y to be mixed with abs,Z.
hmm, that idea is pretty interesting... but it would technically break compatibility with the 65CE02. plus what would it even effect? for example you mentioned SP+Y even though it has no Absolute part to it.
If you really mean it's everything "Absolute" related than it would also effect "Absolute Indirect" like in JMP (abs) and JMP (abs,X), and "Long Absolute" and "Long Indirect" Addressing Modes.
I'll think about it, but i'd probably keep ABS pure to make it easier for people to get into the CPU as not having an un-Indexed Absolute Addressing mode would likely weird out some people.
I kinda wish i had a bit to spare in my Secondary Flags Register so i could make a toggle between "abs" and "abs,Z" just to see if i could get used to it.
I should really continue this project sometime in the future but CC65 got me in it's claws for right now.
obviously i'll post whenever i got something done on this, until then have a good a day/week/month/year!