Personal 6502 "evolution" project

Sheep64 · Post by **Sheep64** » Fri May 21, 2021 11:44 am

I am alarmed when people work at a blistering pace. I think that I'm either making hard for myself, the other person has missed a fundamental problem or they have vastly more talent than me. Given your apparent competence with instruction pipelines and micro-operations, I suspect that you have a particular aptitude for processor design which I lack.

You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.

I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.

I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.

You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)

I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.

Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.

If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.

Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.

aleferri · Post by **aleferri** » Sun May 23, 2021 12:46 pm

Hi Sheep64!
Well, point per point:

Quote:

I am alarmed when people work at a blistering pace. I think that I'm either making hard for myself, the other person has missed a fundamental problem or they have vastly more talent than me. Given your apparent competence with instruction pipelines and micro-operations, I suspect that you have a particular aptitude for processor design which I lack.

I am a software engineer, but it's been years that i became interested in cpu's dev. I don't think i am missing a fundamental problem, at most i would have to waste a lot of time to include "JSR memory". I didn't complete testing and i am completely missing interrupts. In fact i need to assemble the whole core (fetch_unit.v, decode_unit.v, ucore.v, lsu.v).

Quote:

You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.

Historical accuracy help to balance my decision to re-encode the whole instruction set. If i change too much people will come and say "this is not related to 6502". Registers B and Z and the inclusion of SF and PC in the addressing space make my design 8 registers.

Quote:

I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.

Flags are hard when not implemented directly in hardware, and when are individually implemented is even worse with hardware too. It is the reason for the inclusion of the s flag in the opcode (s = 0 mean don't save flags) and for the decision to have instruction with s = 1 write the whole register, instead of individually.

Quote:

I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.

I cannot shrink them:
3 bits for source A
3 bits for source B
1 bit reserved
1 bit bypass using k16
4 bits dest ( 8 registers + WriteMAR_Width8, WriteMAR_Width16, Discard x 6),
4 bits alu function (ADD/ADC, SUB/SBC, INC, DEP, LSR/ROR, ASL/ROL, AND, ORA, EOR, LDA, EXT, BSW, NOP, NOP, NOP, NOP)
1 bit save result
1 bit carry mask
1 bit save flags
1 bit load
1 bit write

I will respond to the remaining considerations tomorrow.

EDIT: grammar

aleferri · Post by **aleferri** » Mon May 24, 2021 9:12 am

Quote:

You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)

Actually, i plan to switch from 16 to 32 bit, not 64 bit. But i know i will have to use a bigger FPGA for the 32 bit transition (my very old Altera Cyclone II isn't going to fit the design). But EP4CE15 dev boards are cheap at 30€ with ethernet, 32 mb sdram, COM and 15k logic element, more than enough for the 32 bit extension.

Quote:

I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.

My design explicitly serialize on stores, RMW are stores, an additional pin ML is a good idea, thank for your suggestion.

Quote:

Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.

Tommasulo is related to speculative execution, but not necessarily imply it. Even in 2021 speculative execution is not fully understood as new side-channel attacks come out. I don't have the horsepower to debug and verify a speculative execution trace. I know how to implement it, in fact i could extend the design to implement it in a week, but then i would need an year to debug it. I am plain bad at keeping projects going for years.

Quote:

If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.

Unfortunately i don't have enough instructions left to implement a second set, unless i give up on ALU on unaligned memory and only implement LDA/STA. In this case i can use your method.

Quote:

Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.

Like i said before, i am not ready yet. The pace may seems fast, but debug is slow. I will have to hunt bugs for a month before the design can be considered BETA. Thank you for you input anyway, a detailed and in-depth review about a design is not something you get every day.

aleferri · Post by **aleferri** » Tue May 25, 2021 1:55 pm

Update.
it reached alpha stage. I have to fix the decoder, the internal core seems to work. Fetch seems to work too.
Simulated with iverilog. Synthesis seems to be fine with yosys.
I discarded short instructions, they are infrequent and not worth the hassle. Poor decisions were fixed (e.g. reset to zero, but zero is a stable state).

Missing instructions: bsr, jsr.
Missing features: interrupts.

If you like to dive in a sea of pain, you can simulate instructions sequences and report bugs.

aleferri · Post by **aleferri** » Fri May 28, 2021 1:56 pm

Good news.
I fixed another ton of bugs. Now this program

Code: Select all

LDZ      #0 ?
LDS      #0
LDY      #$64
LDA      #$C000
STA      $00A0
LDA      #$B000
STA      $00A2
SUB:Y    #1 ?
LDA      byte ($00A0), Y
STA      byte ($00A2), Y
BNE      -16

Run correctly in 812 cycles.

aleferri · Post by **aleferri** » Sat May 29, 2021 4:12 pm

I had this weird idea about implementing interrupts and jsr/bsr with the same circuitry.
After all they look the same, push pc first and then jump to address (address = pc + offset if bsr).
I hope it doesn't turn out too difficult to manage.

aleferri · Post by **aleferri** » Wed Jun 23, 2021 10:56 am

Ok, guys. Flags are hard. They are the worst part of the design. I made them addressable and i regret doing it.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.

BigEd · Post by **BigEd** » Wed Jun 23, 2021 12:10 pm

It's good to be able to experiment, and to backtrack!

barrym95838 · Post by **barrym95838** » Thu Jun 24, 2021 8:59 pm

aleferri wrote:

Ok, guys. Flags are hard. They are the worst part of the design. I made them addressable and i regret doing it.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.

I have pondered overhauling my incomplete 65m32a design once again to replace nearly all of the condition codes with special pdp-8 inspired instructions like test-and-skip-if, decrement-and-skip-if, compare-and-skip-if, et al. It's different enough that I may try some coding exercises to see which one feels preferable, but my general feeling is that the skip style will have a potentially higher performance ceiling and the flag style will be more tidy and pleasurable in a 65xx kind of way. Fertile spare time is my elusive frenemy, as it has been for a large number of years.

Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project

Re: Personal 6502 "evolution" project