Re: Personal 6502 "evolution" project
Posted: Fri May 21, 2021 11:44 am
I am alarmed when people work at a blistering pace. I think that I'm either making hard for myself, the other person has missed a fundamental problem or they have vastly more talent than me. Given your apparent competence with instruction pipelines and micro-operations, I suspect that you have a particular aptitude for processor design which I lack.
You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.
I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.
I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.
You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)
I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.
Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.
If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.
Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.
You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.
I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.
I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.
You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)
I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.
Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.
If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.
Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.