I am alarmed when people work at a blistering pace. I think that I'm either making hard for myself, the other person has missed a fundamental problem or they have vastly more talent than me. Given your apparent competence with instruction pipelines and micro-operations, I suspect that you have a particular aptitude for processor design which I lack.
You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.
I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.
I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.
You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)
I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.
Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.
If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.
Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.
Personal 6502 "evolution" project
- Sheep64
- In Memoriam
- Posts: 311
- Joined: 11 Aug 2020
- Location: A magnetic field
Re: Personal 6502 "evolution" project
Hi Sheep64!
Well, point per point:
I am a software engineer, but it's been years that i became interested in cpu's dev. I don't think i am missing a fundamental problem, at most i would have to waste a lot of time to include "JSR memory". I didn't complete testing and i am completely missing interrupts. In fact i need to assemble the whole core (fetch_unit.v, decode_unit.v, ucore.v, lsu.v).
Historical accuracy help to balance my decision to re-encode the whole instruction set. If i change too much people will come and say "this is not related to 6502". Registers B and Z and the inclusion of SF and PC in the addressing space make my design 8 registers.
Flags are hard when not implemented directly in hardware, and when are individually implemented is even worse with hardware too. It is the reason for the inclusion of the s flag in the opcode (s = 0 mean don't save flags) and for the decision to have instruction with s = 1 write the whole register, instead of individually.
I cannot shrink them:
3 bits for source A
3 bits for source B
1 bit reserved
1 bit bypass using k16
4 bits dest ( 8 registers + WriteMAR_Width8, WriteMAR_Width16, Discard x 6),
4 bits alu function (ADD/ADC, SUB/SBC, INC, DEP, LSR/ROR, ASL/ROL, AND, ORA, EOR, LDA, EXT, BSW, NOP, NOP, NOP, NOP)
1 bit save result
1 bit carry mask
1 bit save flags
1 bit load
1 bit write
I will respond to the remaining considerations tomorrow.
EDIT: grammar
Well, point per point:
Quote:
I am alarmed when people work at a blistering pace. I think that I'm either making hard for myself, the other person has missed a fundamental problem or they have vastly more talent than me. Given your apparent competence with instruction pipelines and micro-operations, I suspect that you have a particular aptitude for processor design which I lack.
Quote:
You are correct to discard 16 register design. Structured programs are 4-colorable graphs and, from the view of execution units, so is an unstructured program. A 4-colorable, 3-address system requires no more than six perfectly orthogonal registers and deliberate asymmetry may benefit instruction density. Therefore, you are strongly advised to consider an eight register design rather than the ad hoc addition of RegB and RegZ which has historical accuracy in 6516 and 65CE02 and remains useful in contemporary bytecode. Admittedly, my proposed extension to 65CE02 has an alternative register set and this is arguably an architecture extension with RegB and RegZ. However, invocation occurs in a common prefix with operand size such that all 65CE02 opcodes facilitate an otherwise trivial and pure RISC extension.
Quote:
I recently considered micro-ops for JIT flag handling within 8080 on 6502 simulation and 6502 on AVR simulation. The general consensus was that it was not a fun hobby project, too much work for too little gain or unworkable within 16 bit address-space. This may be true for software but you demonstrate that micro-ops are fun in FPGA.
Quote:
I am concerned that you are upscaling 6502 instructions to 20 bit micro-ops. barrym95838's 65m02 is a strict, regular superset of 6502 and only requires 15 bit or so opcodes. Across three micro-ops, it should be possible to obtain 6-12 bit representation. Regular instructions are grouped into eight operations across eight addressing modes. Therefore, if you split such instructions into micro-ops, I would expect a reduction of cases. If you primarily read, modify or write with one micro-op, I presume the encodings are skewed towards ALU function or addressing mode. In such case, it my be preferable to hold the micro-op in a multiplexed representation for the appropriate stage of the pipeline. Hopefully, this compacts FPGA layout and raises the maximum clock speed.
3 bits for source A
3 bits for source B
1 bit reserved
1 bit bypass using k16
4 bits dest ( 8 registers + WriteMAR_Width8, WriteMAR_Width16, Discard x 6),
4 bits alu function (ADD/ADC, SUB/SBC, INC, DEP, LSR/ROR, ASL/ROL, AND, ORA, EOR, LDA, EXT, BSW, NOP, NOP, NOP, NOP)
1 bit save result
1 bit carry mask
1 bit save flags
1 bit load
1 bit write
I will respond to the remaining considerations tomorrow.
EDIT: grammar
Re: Personal 6502 "evolution" project
Quote:
You are wise to avoid 64 bit extension until the instruction pipeline is working. FPGA addition is O(n log n) and therefore basic functionality, such as ALU, may bloat more than expected - with consequent performance loss. (Even when treating a microcontroller as a convenient microcode implementation, loss of performance is linear. It also bloats EEPROM to the extent that features are compromised or mutually exclusive.)
Quote:
I presume that your detailed consideration of micro-ops led to solving atomic operations. I tried to solve this problem for more than one year without success - publicly since Aug 2020. Your proposal for read-modify-write with conditional write is perfectly workable and I believe that it averts the ABA problem. It would be particularly compatible with 6502 variants with ML [Memory Lock] signal line - and may encourage implementers to use such signal. Furthermore, it would almost entirely eliminate Turing complete use of interrupt mask and therefore use would improve interrupt performance.
Quote:
Don't apologize for not using the full Tommasulo algorithm nor implementing speculative execution. Prior to Dec 2019, no-one at Intel, AMD or ARM was able to implement the speculative extension correctly. Indeed, if you read the Transputer or VIPER documents, it should be apparent that formal methods are typically about eight years behind products. That's only four iterations of Moore's law - and beyond the life of most products. Actually, a simplistic, deterministic implementation with the occasional wasted cycle is beneficial for atomic operations and other cases. It allows, for example, interrupt routines to complete in a fixed number of cycles. This is more important than aggregate performance; especially in low power or massively parallel environments.
Quote:
If you are experimenting with micro-ops and disallowing unaligned access, you may wish to consider a second set of wide instructions. In this case, one aligned operation only requires use of the first set of wide instructions. However, if the first set of wide instructions is used in an unaligned manner, operands are barrel shifted but no attempt is made to initiate a second memory cycle. Unaligned operations are completed with the second set of wide instructions. Within a little-endian architecture, it is possible to make this work with logical operations, arithmetic operations and flag handling. In addition to simplifying an instruction pipeline and memory interface, it also means never handling access across cache lines, MMU pages or bank switching windows.
Quote:
Anyhow, I am very impressed with the speed at which you work and I am particularly impressed that you solved a problem which eluded me. I hope my suggestions improve the throughput and generality of your design.
Re: Personal 6502 "evolution" project
Update.
it reached alpha stage. I have to fix the decoder, the internal core seems to work. Fetch seems to work too.
Simulated with iverilog. Synthesis seems to be fine with yosys.
I discarded short instructions, they are infrequent and not worth the hassle. Poor decisions were fixed (e.g. reset to zero, but zero is a stable state).
Missing instructions: bsr, jsr.
Missing features: interrupts.
If you like to dive in a sea of pain, you can simulate instructions sequences and report bugs.
it reached alpha stage. I have to fix the decoder, the internal core seems to work. Fetch seems to work too.
Simulated with iverilog. Synthesis seems to be fine with yosys.
I discarded short instructions, they are infrequent and not worth the hassle. Poor decisions were fixed (e.g. reset to zero, but zero is a stable state).
Missing instructions: bsr, jsr.
Missing features: interrupts.
If you like to dive in a sea of pain, you can simulate instructions sequences and report bugs.
Re: Personal 6502 "evolution" project
Good news.
I fixed another ton of bugs. Now this program
Run correctly in 812 cycles.
I fixed another ton of bugs. Now this program
Code: Select all
LDZ #0 ?
LDS #0
LDY #$64
LDA #$C000
STA $00A0
LDA #$B000
STA $00A2
SUB:Y #1 ?
LDA byte ($00A0), Y
STA byte ($00A2), Y
BNE -16Re: Personal 6502 "evolution" project
I had this weird idea about implementing interrupts and jsr/bsr with the same circuitry.
After all they look the same, push pc first and then jump to address (address = pc + offset if bsr).
I hope it doesn't turn out too difficult to manage.
After all they look the same, push pc first and then jump to address (address = pc + offset if bsr).
I hope it doesn't turn out too difficult to manage.
Re: Personal 6502 "evolution" project
Ok, guys. Flags are hard. They are the worst part of the design. I made them addressable and i regret doing it.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.
Re: Personal 6502 "evolution" project
It's good to be able to experiment, and to backtrack!
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Personal 6502 "evolution" project
aleferri wrote:
Ok, guys. Flags are hard. They are the worst part of the design. I made them addressable and i regret doing it.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.
I regret it so much that now i will remove addressable flags and put 2 special instructions to load and store flags. This will simplify the design a lot.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)