6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun May 05, 2024 12:21 pm

All times are UTC




Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Tue Jul 05, 2022 5:17 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10797
Location: England
Hmm, the 6502 list price at launch in single-unit quantities is one thing, the price Apple paid for the purposes of the Apple II another. Even so, a price of $15 for a 68k sounds pretty good - but not many purchasers could offer orders of quantity a million.

I suspect you might be underestimating the difficulty of boosting PC to 24 bits (and adding in a chunk of handling of 3 byte addresses. It might be informative, although I think not at all easy, to try to upgrade Arlet's simple 6502 HDL to this level. Not only would that give a measure at the level of HDL source complexity, but also after synthesis a measure at gate-count implementation.

Actually, it would do more - it would give us a 65M202 to play with!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 05, 2022 6:35 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
jds wrote:
It could be argued that the 65816 does have a flat memory model. If the programmer uses the long addressing modes exclusively then the address space is flat. The two byte addressing modes can be considered as optimisations of that flat address space, just like the single byte branches and direct pages are also optimisations rather than limitations.

There is a small wrinkle with this way of thinking regarding the program counter not crossing bank boundaries, but this is not a major issue to deal with

I agree. Mensch did a good compromise.

Quote:
The biggest change with the 68000 was the massive jump in registers, both in count and in size. A general purpose successor to the 65816 would probably need to go down this path, as did the x86 line of processors, adding registers and increasing their size with each generation.

Mensch was asked in a conference last year, "What would a 32-bit 6502 look like today?" and He said, "ARM."

Note however that more and wider registers don't necessarily translate to higher performance. Sophie Wilson, chief architect of the ARM processor, said, "an 8MHz 32016 was completely trounced in performance terms by a 4MHz 6502." (The 32016 was National's 32-bit processor, having 15 registers, including 8 general-purpose 32-bit registers, and a 16-bit external data bus.) The 1802 processor with all its registers was a real dog performancewise. Jack Crenshaw, an embedded-systems engineer who wrote regularly in Embedded Systems Programming magazine said in the 9/98 issue that he still couldn't figure out why, in BASIC benchmark after benchmark, the 6502 could outperform the Z80 which had more and bigger registers, a seemingly more powerful instruction set, and ran at higher clock rates. (The 6502's zero page and improved indexed and indirect addressing modes no doubt helped.)

Quote:
Other performance improvements could be made having little impact on the instruction set, such as instruction pipelining, caching, virtual memory and increases in the size of the data bus.

There aren't very many dead bus cycles, and the 65CE02 eliminated almost all dead bus cycles; so without a total change in architecture, additional pipelining would not have made any improvement in performance. The 816's VDA and VPA were supposedly initially intended for filling separate data and instruction caches, according to the data sheet. I have no idea whether they were ever used for that.

Quote:
Personally I would like to use a design that used a prefix byte to specify additional registers and widths rather than the processor flags we currently have.

Have you done much programming with it? Writing my '816 (actually '802) Forth, I found that I very seldom had to change the M and X flags. They made it pretty efficient, compared to having to often add prefix bytes.

65LUN02 wrote:
That [68000] design then won every CPU competition in the next decade.

In some ways, yes; in others, no. The '816 compared favorably to the 68K in the Sieve benchmark, and the '02 absolutely blew the doors off the 68K in interrupt performance, something that is very important to my uses. It is not out of the question to have an off-the-shelf 65c02 service a million interrupts per second. It is my understanding however that Mot went to compiler designers and asked them what they would like in a processor, and then addressed their wishes in the 68K design, making a processor that was a better target for C (and other) compilers. It did seem like in the years around 1990, what I was gathering from the industry magazines (EDN, Electronic Design, Test and Measurement, Computer Design, etc.) was that Mot was always a little bit ahead of Intel, regarding the 68K family versus the x86 family. My perception was that Mot had an advantage in that they started with a clean slate whereas Intel was trying to protect the earlier software investments which meant that they had to add kludge upon kludge to upgrade the x86's.

Although I have a good memory, the fact is I was watching the developments only from a distance, so I don't consider myself any kind of authority in this stuff. My job was (ans still is) mostly analog, and I use the little computers for controlling the analog. I never got into the big stuff. I like the 65 family because I don't have to be a computer engineer to design with it. It has an excellent power-to-complexity ratio.

Quote:
Apple, Commodore, Atari, Sun, HP, Apollo, Silicon Graphics. Except for the IBM PC (which would have gone to the 68000 too if only enough test chips were ready in 1979), the 68000 was the go-to chip of the 1990s, only in the 1990s to be replaced by the wave of RISC chips, as the 68000 turned out to be too difficult for Motorola to upgrade.

The industry magazines I was bombarded with at work in the late 1980's had a constant buzz about RISC. Computer Design is one I gave only very casual attention to before tossing them in the recycling; but here's the cover of one I kept, for reasons I don't remember, from Nov 1989:
Attachment:
ComputerDesignRISCissueNov89.jpg
ComputerDesignRISCissueNov89.jpg [ 79.67 KiB | Viewed 1677 times ]

The benefit being touted was that with the simpler instruction set, although it would take more instructions to get the job done, the clock rate could be raised more than enough to compensate. The mantra was "higher performance at lower cost." It seemed however that that was quickly changing to "maximum performance at any cost," and I lost what little interest I had. Since then, the lines have been blurred, and CISC processors' instructions get broken down into RISC instructions internally before being executed. Bill Mensch says the '02 is neither RISC nor CISC, and calls it "addressable-register architecture," or "ARA," as basically all of zero page is 256 bytes of processor registers.) My only up-close experience with RISC is with the PIC16 microcontrollers which I've put in a dozen products. It may hardly qualify as a RISC (in spite of what Microchip said), but it did have separate instruction and data buses, something that made it more difficult to work with, even though the performance was not as good as that of the 65c02 which had the older Von Neumann architecture. I can see the attraction to have a compiler that hides the Harvard architecture complexities from the programmer. There's no need for that with the 65xx.

Edit: I just now read several articles in the above Computer Design magazine issue.. Maybe I'll later post a scan the table of contents and one of the articles.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 06, 2022 4:48 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
GARTHWILSON wrote:
Edit: I just now read several articles in the above Computer Design magazine issue.. Maybe I'll later post a scan the table of contents and one of the articles.[/color]

It's a supplemental issue. It looked like one, but now it's confirmed here. Here's the table of contents:
Attachment:
CompDesRISCissueNov89contents.gif
CompDesRISCissueNov89contents.gif [ 174.33 KiB | Viewed 1631 times ]

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 06, 2022 2:18 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
RISC vs CISC was the big story in personal computers, workstations, and the remains of minicomputers in the mid-to-late-1980s.

I was working on my BS in Computer Science during this period. The Sun, HP, and other workstations on campus at Carnegie Mellon were all 680x0 in 1987. By 1991 the new Sun workstations were running on Sparc, and in the Computer Engineering class in 1990 the big project was implementing a subset of a Sparc in Verilog.

I was at IBM in 1991 and 1992 when the IBM/Motorola/Apple partnership launched the PowerPC chip set. Motorola had already developed their 88000 by then, but gave it up as part of that partnership. The Power architecture was IBM's RISC architecture.

I was running a software company focused on pen computers and PDAs in 1993 when the Apple Newton launched. It was one of the first ARM-based handhelds. The others were the WindowsCE handhelds that came out about the same time. The EO handhelds from 1992 had instead used the AT&T Hobbit chipset, which was RISC-esque, but optimized for running compiled C. Palm was the only manufacturer to sell any units at scale, but they were based on the 68328, a low-power 68020. https://en.wikipedia.org/wiki/Freescale_DragonBall

What I remember of that era was the religious-like arguments of the various architectures. None were clearly better than others.

PowerPC had enough traction from Apple and enough push from IBM that there was a WindowsNT version released, but eventually neither IBM nor Motorola made that architecture competitive. Sun sold enough workstations and early-web rack servers to pay for iterations on the Sparc, but that was killed when Oracle purchased Sun. HP didn't seem to have enough sales to keep up their PA-RISC architecture. MIPS turned into RISC-V, which still exists, but is niche.

ARM ended up the big winner, but looking at the latest ARM instruction set, it feels nearly as complex as the 680x0. So it seems that all sides of the debate were wrong, that what the market actually wanted wasn't the pure simplicity of RISC nor the massive set of addressing modes of CISC, but something in between.

Swinging back to the 6502, I think that is why that chip still gets chosen. It's not RISC, but it is simple. It's not CISC, but has sufficient addressing modes to get the jobs done. It never did get a multiplier, but code for multiplying is small and fast enough.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 10, 2022 7:26 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
BigEd wrote:
I suspect you might be underestimating the difficulty of boosting PC to 24 bits (and adding in a chunk of handling of 3 byte addresses. It might be informative, although I think not at all easy, to try to upgrade Arlet's simple 6502 HDL to this level. Not only would that give a measure at the level of HDL source complexity, but also after synthesis a measure at gate-count implementation.

https://github.com/lunarmobiscuit/verilog-65C2402-fsm proves you otherwise. That is a working implementation of the 65C2402 as envisioned in the first post on this thread, with all the ABS[,X|Y] opcodes supporting 3-byte addresses, plus JMP (ABS[,X]) and RTS and RTI.

Total clock time was approximately 10 hours, with half of that fighting with Verilog as I've not touched the language since 1999, plus I found one bug in Arlet's code that he fixed in 2-minutes but which a few hours of my time not realizing it didn't work in https://github.com/Arlet/verilog-65C02-fsm.

I don't know how to synthesize Verilog into CMOS to compare gate counts, but comparing Arlet's 65C02 and my 65C2402 in terms of lines of code and size of the netlist:
Code:
- wc -l ab.v alu.v regfile.v cpu.v;
-  65C02 65C2402
-    94    120    ab.v
-    91     93    alu.v
-    59     74    regfile.v
-   684    757    cpu.v
-   928   1044    TOTAL       
-
- iverilog -S -Nnetlist; wc -l netlist
-  1593   1775    netlist

All these changes was thus just 116 more lines of code, where rows (where at least 16 of those lines are new comments or commented out debugging code that I left in). So approx 11% more lines of code. In terms of netlist (which includes no comments) my design has 182 more items, which again is approx 11% larger. That feels about right given the amount of effort and extent of the changes.

From the README:

## Design goals
The main design goal is to show the possibility of a backwards-compatible 65C02 with a 24-bit
address bus, with no modes, no new flags, just two new op-codes: CPU and A24

$0F: CPU isn't necessary, but fills the A register with #$10, matching the prefix code
$1F: A24 does nothing by itself. Like in the Z80, it's a prefix code that modifies the
subsequent opcode.

When prefixed all ABS / ABS,X / ABS,Y / IND, and IND,X opcodes take a three byte address in the
subsequent three bytes. E.g. $1F $AD $EF $78 $56 = LDA $5678EF.

Opcode A24 before a JMP or JSR changes those opcodes to use three bytes to specify the address,
with this 24-bit version of JSR pushing three bytes onto the stack: low, high, 3rd. The matching
24-bit RTS ($1F $60) pops three bytes off the stack low, high, and 3rd.

RTI always pops four bytes: low, high, and 3rd for the IR, then 1 byte for the flags
(But Arlet's code doesn't support IRQ or NMI, so this CPU never pushes those bytes)

The IRQ, RST, and NMI vectors are $FFFFF7/8/9, $FFFFFA/B/C, and $FFFFFD/E/F.

Without the prefix code, all opcodes are identical to the 65C02. Zero page is unchanged.
ABS and IND addressing are all two bytes. Historic code using JSR/RTS will use 2-byte/16-bit
addresses.

The only non-backward-compatible behaviors are the new interrupt vectors. A new RST handler
could simply JMP ($FFFC), presuming a copy of the historic ROM was addressable at in page $FF.
A new IRQ handler similarly JMP ($FFFE). The only issue would be legacy interrupt handlers
that assumed the return address was the top two bytes on the stack, rather than three.

## Changes from the original

PC (the program counter) is extended from 16-bits to 24-bits
AB (the address bus) is extended from 16-bits to 24-bits
D3 (a new data register) is added to allow loading three-byte addresses

One new decode line is added for pushing the third byte for the long JSR

A handful of new states were added to the finite state machine that process the opcodes, in general
just one new state for handling ABS addresses, three-byte JMP/JSR, and three-byte RTS/RTI

------------

There isn't any assembler, so all the test code in ram.hex was hand-assembled. I read that Woz coded up the 6502 two characters at a time, as he had memorized all the opcodes and addressing modes. My 10 hour estimate includes doing the same as well as running the code and checking it not just opcode by opcode but state by state to make sure it works.

Try it out yourself and let me know what you think. The README provides instructions for running the testbed.

Thank you Arlet for the code to build upon as well as the 12-hour tech support turnaround.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 10, 2022 7:55 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10797
Location: England
Splendid! I don't think I was aware of this latest adventure of Arlet's - in the thread
My new verilog 65C02 core.
we're seeing the predecessor project, the one I'm more familiar with. (verilog-65C02-fsm starts Jan 2022, whereas verilog-65C02-microcode (previously verilog-65C02) got the most recent substantial update in Nov 2020 (and some coding tweaks since.))


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 10, 2022 8:01 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
65LUN02 wrote:
There isn't any assembler, so all the test code in ram.hex was hand-assembled.

You might like to know about Cross-32 (C32 for short) which I use, formerly from Universal Cross Assemblers, now sold by both Data Sync Engineering and MPE, available at https://www.datasynceng.com/c32doc.htm and https://www.mpeforth.com/cross32.htm . It's an excellent macro assembler that assembles for lots of different processors, and they also give you the information to adapt it to a processor of your own design. Supposedly it's about a 40-hour job to do that.

(Others here have heard me say this, but you joined less than two weeks ago.)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 11, 2022 4:45 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Thanks Garth, but I'll pass on customizing that or any other assembler until someone helps finish Arlet's design, synthesizes it into an FPGA (or a handful of CPLDs), and builds the first 6502 bench computer that has a flat 1MB+ address space.

I've not touched an FPGA since 1999 and have no interest in dealing with many dozens of hours of debugging to get from Verilog to working silicon.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 11, 2022 5:43 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Nice job. Glad I could help.

I had started the 6502 FSM project thinking I could make a readable and small/fast design, but got disappointed at the Xilinx tool chain not understanding how to optimize the synthesis properly, and hand-optimizing the code would make it much harder to understand. It's a shame that Xilinx basically dropped support for their simpler FPGAs.

At least I'm glad it helped you with this project.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 12:50 am 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Back in my original post on this thread... beyond 24-bit addressing I asked the question about multi-threading

65LUN02 wrote:
Multi-threading
If the 6502 were to grow up incrementally as the main CPU family for the Mac, it would inevitably be looked at by the Unix workstation companies too, just as Sun followed the crowd to the 68000 before jumping on the RISC bandwagon with SPARC. Context switching with 32 registers is a very expensive operation. The one advantage of just A/X/Y/S/SP/PC is less context and that advantage only goes up when that context is all on-chip instead of having to be copied to/from the stack.

Meet the https://github.com/lunarmobiscuit/verilog-65C24T8-fsm, the upgrade to the https://github.com/lunarmobiscuit/verilog-65C02-fsm that supports 8 threads. Fully backward compatible to the 65C2402 and 65C02. From the README:

The 6502 wasn't commonly used for multi-tasking operating systems, but in the alternative history of 24-bit addresses, a possible next iteration would be to add a few threads, useful for either speeding up interrupt handling or for running multiple processes at once.

The key change is replicating the registers, and not just A/B/X but also S (the stack register), PC (the program counter), and P (the processor flags). One of each takes up less than 10% of the die space of the CPU. Adding seven copies of each adds a lot of transitors between the register and the muxes plus the extra logic for a few more opcodes.
This implemention adds eight and a half new opcodes.

$03: THR, switch to another thread, where the thread number is stored in Y
$13: THW, similar to WAI, except it just halts the thread by switching to thread #0, which is presumed to be the interrupt handler and scheduler
$23: THY, copy the current thread's index value into Y
$33: THI #89ABCD, set the PC of thread Y to the immediate (24-bit) address
$43: TTA, copy register A[Y] to A (where A is the A register of the current thread)
$53: TAT, copy A to register A[Y]
$63: TTS, copy the stack register from thread Y to X
$73: TST, copy X to the stack register of thread Y
$F3: _T_, NOP, which techically isn't needed, but makes it easier to see when a thread changes in the debug log

The Y index register was used to specify the thread being run or modified so that code need not be hardwired with immidiates if there were different versions of the CPU with different numbers of threads. The CPU instruction from the 65C2402 has been modified to return (in the A register) the number of threads in the least significant nibble. That thus supports up to 15 threads.

The threads each have their own stack. Thread 0's is at the traditional range of $0100-$01FF. Each thread is $0100 higher in memory. Thus thread 7 uses $0800-$08FF.

The TTS and TST instructions let one thread push items onto other thread's stacks, but only through STA operations and quite a bit of instructions. There could be push and pull opcodes added that use Y to specify which stack to push and pull from. That might be useful for single threaded implementations of Forth and C, which generally use multiple stacks.

As Arlet's 65C02 does not include interrupts, I've not modified that logic. There are a few ways interrupts can be improved with this design. The best way is to assume thread 0 is the interrupt handler. Upon an IRQ or NMI, the CPU could set a flag in thread 0's processor flags and after the current opcode finishes, switch to thread 0. This can be fast as thread switching can be done concurrently with looking up the IRQ/NMI vector. Plus the A, X, and Y registers are all set with whatever values the interrupt handler last left them with, and the stack has no entries from any of the other threads.

To make it even faster, the CPU could lookup those vectors upon RST, and if they they are all zeros, don't lookup the vectors again until another RST and instead just switch to thread 0 upon an IRQ or NMI (or BRK). For backwards compatibility, the standard IRQ/NMI behavior of pushing the RTI address and processor flags can continue, without switching the current thread.

For multi-tasking operating systems, these new opcodes are sufficient with no extra hardware for cooperative multitasking. The scheduler runs in thread 0. The other threads use THW to yield to the scheduler. The original Macintosh OS was like this, with a yield() system call. For Unix-like preemptive multitasking, an external clock generates an interrupt, and the interrupt handler and scheduler work together to pick the next thread to run.

---

In terms of complexity, adding this took less than 100 extra lines of code (Arlet's whole design is around 900 lines of code). The netlist grew by more than 20%, but that seems a little low given eight threads adds 452 more bits of registers to store. At 6 transistors per bit (that is the cost per bit for SRAM), that is 2712 more transistors. So maybe back in 1980 adding ~50% more transistors wouldn't be worth the effort, but there is nothing special about eight threads. Four threads would take only 226 more register bits and only 1356 more transistors.

@Garth, you like fast interrupts. The benefit of this design is that IRQ/NMI can be just a context switch, skipping even the three cycles to load the IRQ/NMI vector, plus the cycles to push the PC and flags. Optimize the decode logic and that could be just 1 cycle to switch thread and 1 cycle to load the first opcode from the interrupt handler.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 7:33 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10797
Location: England
Well done for thinking the idea through and writing the corresponding HDL!

Just a few comments:

> As Arlet's 65C02 does not include interrupts

This is true at present for his latest experiments, but his earlier 6502 and 65C02 were complete (and correct!) I think it's very important we recognise the history and contribution here - his first 6502 implementation here was a great leap forward. We shouldn't confuse an unfinished experiment with that.

> The CPU instruction from the 65C2402 has been modified to return (in the A register) the number of threads in the least significant nibble. That thus supports up to 15 threads.

In a situation like this, I'd be strongly inclined to make 0 mean 16.

> The key change is replicating the registers, and not just A/B/X but also S (the stack register), PC (the program counter), and P (the processor flags). One of each takes up less than 10% of the die space of the CPU. Adding seven copies of each adds a lot of transitors between the register and the muxes plus the extra logic for a few more opcodes

How this kind of architectural proposal would play out depends somewhat on the implementation technology and techniques. At some point, in most technologies, after some point a wide mux will be replaced by a tristate bus. Inside FPGA, these days, we're not offered tristate busses, but we can use wired-OR for a similar benefit.

One great leap forward in Arlet's first 6502 implementation was to use 'distributed RAM' for A, X, Y, and S. In effect that's a small 4x8bit register file, with an efficient on-FPGA implementation. Using a register file becomes a normal tactic, above a certain size - even the Z80 has a register file. One can suppose that any given technology provides a reasonably efficient register file.

But register files are characterised by how many ports they have, and some instruction set choices will demand more ports than are efficient to implement, so that becomes another concern.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 10:18 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yes, I'm afraid the current implementation introduces extra ports in the memory with all the new transfer instructions. That's perfectly fine of course, if you're still in the design phase and just want to try things out in simulation.

If you want to optimize performance in hardware, it's best to make these access ports explicit in the regfile interface, so you can see exactly how many you need, and maybe figure out a way to squeeze multiple accesses through the same port. Especially dual write ports are expensive, resource-wise. If you need multiple ports, the best is to try to have a single writer, and N readers.

Alternatively, you could have a simpler implementation where you don't use extra ports, but limit the scope of the new transfer instructions. For instance, you could make some registers shared between all the tasks, or allow one task to access a register from the other task with an opcode prefix.

Another thing is that you can replace the multiple register files with one big one, and use the thread id as the most significant bits.

There's also a trade off that can be made with the flags register and PC. If you make the task switch first push P/PC on the stack, switch SP, and then pop P/PC from the new stack, you don't need extra registers/muxes, at the cost of 6 extra cycles for a task switch.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 3:58 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Arlet, an efficient implementation would add src_thr and dst_thr inputs to the alu and regfile with matching decode/control lines, and other inputs to ab so that the PC calculations could go back in there. I just wanted to see the idea running code, so I brute forced the logic.

Ultimately, the additional number of gates in a real implementation is driven far more by the number of copies of the registers. Each set of A/B/X/S/P/PC is 8+8+8+7+24 = 55 bits, plus 2-3 more bits for a T register to hold the current thread index.

Eight sets of registered requires roughly the same number of transistors as the original NMOS 6502. I've not seen an accurate count of the CMOS 65C02, but I've seen that estimated as 5,000 transistors. So eight sets of registers is roughly 50% additional transistors.

The fun for me was seeing a simulation like this:
Code:
   0 R           PC:xxxxxx AB:xxxxxx W:x DI:xx DR:xx D3:xx DO:xx IR:ea WE:0 ALU:xx S:xx A:xx X:xx Y:xx P:------ T:x
   1 -      BRK4 PC:fffffa AB:fffffa W:1 DI:xx DR:xx D3:xx DO:xx IR:xx WE:0 ALU:xx S:ff A:40 X:01 Y:02 P:------ T:0
   2 -      BRK3 PC:fffffb AB:fffffb W:1 DI:00 DR:xx D3:xx DO:xx IR:00 WE:0 ALU:xx S:ff A:40 X:01 Y:02 P:---I-- T:0
   3 -      JMP0 PC:fffffc AB:fffffc W:1 DI:fd DR:fd D3:00 DO:xx IR:fd WE:0 ALU:xx S:ff A:40 X:01 Y:02 P:---I-- T:0
   4 -      JMP1 PC:fffffd AB:01fd00 W:1 DI:01 DR:fd D3:00 DO:xx IR:01 WE:0 ALU:xx S:ff A:40 X:01 Y:02 P:---I-- T:0
   5 -      SYNC PC:01fd01 AB:01fd01 W:0 DI:a0 DR:fd D3:00 DO:xx IR:a0 WE:0 ALU:xx S:ff A:40 X:01 Y:02 P:---I-- T:0
   6 - LDY  IMM0 PC:01fd02 AB:01fd02 W:0 DI:07 DR:fd D3:00 DO:xx IR:07 WE:0 ALU:fd S:ff A:40 X:01 Y:02 P:---I-- T:0
   7 - LDY  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:07 D3:fd DO:xx IR:33 WE:0 ALU:07 S:ff A:40 X:01 Y:02 P:---I-- T:0
   8 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:07 D3:fd DO:xx IR:10 WE:0 ALU:07 S:ff A:40 X:01 Y:07 P:---I-- T:0
   9 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:fd D3:10 DO:xx IR:fd WE:0 ALU:07 S:ff A:40 X:01 Y:07 P:---I-- T:0
  10 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:07 S:ff A:40 X:01 Y:07 P:---I-- T:0
  11 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:07 S:ff A:40 X:01 Y:07 P:---I-- T:0
  12 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:06 S:ff A:40 X:01 Y:07 P:---I-- T:0
  13 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  14 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  15 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:10 D3:fd DO:xx IR:10 WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  16 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:10 D3:fd DO:xx IR:fd WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  17 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  18 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:06 S:ff A:40 X:01 Y:06 P:---I-- T:0
  19 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:05 S:ff A:40 X:01 Y:06 P:---I-- T:0
  20 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  21 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  22 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:fd D3:10 DO:xx IR:10 WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  23 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:fd D3:10 DO:xx IR:fd WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  24 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  25 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:05 S:ff A:40 X:01 Y:05 P:---I-- T:0
  26 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:04 S:ff A:40 X:01 Y:05 P:---I-- T:0
  27 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  28 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  29 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:10 D3:fd DO:xx IR:10 WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  30 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:10 D3:fd DO:xx IR:fd WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  31 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  32 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:04 S:ff A:40 X:01 Y:04 P:---I-- T:0
  33 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:03 S:ff A:40 X:01 Y:04 P:---I-- T:0
  34 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  35 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  36 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:fd D3:10 DO:xx IR:10 WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  37 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:fd D3:10 DO:xx IR:fd WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  38 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  39 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:03 S:ff A:40 X:01 Y:03 P:---I-- T:0
  40 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:02 S:ff A:40 X:01 Y:03 P:---I-- T:0
  41 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  42 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  43 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:10 D3:fd DO:xx IR:10 WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  44 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:10 D3:fd DO:xx IR:fd WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  45 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  46 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:02 S:ff A:40 X:01 Y:02 P:---I-- T:0
  47 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:01 S:ff A:40 X:01 Y:02 P:---I-- T:0
  48 - BNE  BRA0 PC:01fd09 AB:01fd02 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  49 - BNE  SYNC PC:01fd03 AB:01fd03 W:0 DI:33 DR:fd D3:10 DO:xx IR:33 WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  50 - THI  THP2 PC:01fd04 AB:01fd04 W:0 DI:10 DR:fd D3:10 DO:xx IR:10 WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  51 - THI  THP1 PC:01fd05 AB:01fd05 W:0 DI:fd DR:fd D3:10 DO:xx IR:fd WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  52 - THI  THP0 PC:01fd06 AB:01fd06 W:0 DI:01 DR:fd D3:10 DO:xx IR:01 WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  53 - THI  SYNC PC:01fd07 AB:01fd07 W:0 DI:88 DR:fd D3:10 DO:xx IR:88 WE:0 ALU:01 S:ff A:40 X:01 Y:01 P:---I-- T:0
  54 - DEY  SYNC PC:01fd08 AB:01fd08 W:0 DI:d0 DR:fd D3:10 DO:xx IR:d0 WE:0 ALU:00 S:ff A:40 X:01 Y:01 P:---I-- T:0
  55 - BNE  BRA0 PC:01fd09 AB:01fd09 W:0 DI:f9 DR:fd D3:10 DO:xx IR:f9 WE:0 ALU:00 S:ff A:40 X:01 Y:00 P:---IZ- T:0
  56 - BNE  SYNC PC:01fd0a AB:01fd0a W:0 DI:a0 DR:fd D3:10 DO:xx IR:a0 WE:0 ALU:00 S:ff A:40 X:01 Y:00 P:---IZ- T:0
  57 - LDY  IMM0 PC:01fd0b AB:01fd0b W:0 DI:07 DR:07 D3:fd DO:xx IR:07 WE:0 ALU:fd S:ff A:40 X:01 Y:00 P:---IZ- T:0
  58 - LDY  SYNC PC:01fd0c AB:01fd0c W:0 DI:48 DR:07 D3:fd DO:xx IR:48 WE:0 ALU:07 S:ff A:40 X:01 Y:00 P:---IZ- T:0
  59 - PHA  PHA0 PC:01fd0d AB:0001ff W:0 DI:da DR:07 D3:fd DO:40 IR:da WE:1 ALU:40 S:ff A:40 X:01 Y:07 P:---I-- T:0
  60 - PHA  SYNC PC:01fd0d AB:01fd0d W:0 DI:ff DR:07 D3:fd DO:40 IR:da WE:0 ALU:40 S:fe A:40 X:01 Y:07 P:---I-- T:0
  61 - PHX  PHA0 PC:01fd0e AB:0001fe W:0 DI:5a DR:07 D3:fd DO:01 IR:5a WE:1 ALU:01 S:fe A:40 X:01 Y:07 P:---I-- T:0
  62 - PHX  SYNC PC:01fd0e AB:01fd0e W:0 DI:ff DR:07 D3:fd DO:01 IR:5a WE:0 ALU:01 S:fd A:40 X:01 Y:07 P:---I-- T:0
  63 - PHY  PHA0 PC:01fd0f AB:0001fd W:0 DI:03 DR:07 D3:fd DO:07 IR:03 WE:1 ALU:07 S:fd A:40 X:01 Y:07 P:---I-- T:0
  64 - PHY  SYNC PC:01fd0f AB:01fd0f W:0 DI:ff DR:07 D3:fd DO:07 IR:03 WE:0 ALU:07 S:fc A:40 X:01 Y:07 P:---I-- T:0
  65 - THR  THRD PC:01fd10 AB:01fd10 W:0 DI:db DR:07 D3:fd DO:07 IR:db WE:0 ALU:07 S:ff A:47 X:08 Y:10 P:---I-- T:7
  66 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:23 DR:07 D3:fd DO:07 IR:f3 WE:0 ALU:07 S:ff A:47 X:08 Y:10 P:---I-- T:7
  67 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:07 IR:23 WE:0 ALU:07 S:ff A:47 X:08 Y:10 P:---I-- T:7
  68 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:07 IR:5a WE:0 ALU:07 S:ff A:47 X:08 Y:10 P:---I-- T:7
  69 - PHY  PHA0 PC:01fd13 AB:0008ff W:0 DI:88 DR:07 D3:fd DO:07 IR:88 WE:1 ALU:07 S:ff A:47 X:08 Y:07 P:---I-- T:7
  70 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:07 IR:88 WE:0 ALU:07 S:fe A:47 X:08 Y:07 P:---I-- T:7
  71 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:07 IR:03 WE:0 ALU:06 S:fe A:47 X:08 Y:07 P:---I-- T:7
  72 - THR  THRD PC:01fd15 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:07 IR:ea WE:0 ALU:06 S:fe A:47 X:08 Y:06 P:---I-- T:7
  73 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:07 IR:f3 WE:0 ALU:06 S:ff A:46 X:07 Y:0e P:---I-- T:6
  74 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:07 IR:23 WE:0 ALU:06 S:ff A:46 X:07 Y:0e P:---I-- T:6
  75 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:07 IR:5a WE:0 ALU:06 S:ff A:46 X:07 Y:0e P:---I-- T:6
  76 - PHY  PHA0 PC:01fd13 AB:0007ff W:0 DI:88 DR:07 D3:fd DO:06 IR:88 WE:1 ALU:06 S:ff A:46 X:07 Y:06 P:---I-- T:6
  77 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:06 IR:88 WE:0 ALU:06 S:fe A:46 X:07 Y:06 P:---I-- T:6
  78 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:06 IR:03 WE:0 ALU:05 S:fe A:46 X:07 Y:06 P:---I-- T:6
  79 - THR  THRD PC:01fd10 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:06 IR:ea WE:0 ALU:05 S:ff A:45 X:06 Y:0c P:---I-- T:5
  80 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:06 IR:f3 WE:0 ALU:05 S:ff A:45 X:06 Y:0c P:---I-- T:5
  81 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:06 IR:23 WE:0 ALU:05 S:ff A:45 X:06 Y:0c P:---I-- T:5
  82 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:06 IR:5a WE:0 ALU:05 S:ff A:45 X:06 Y:0c P:---I-- T:5
  83 - PHY  PHA0 PC:01fd13 AB:0006ff W:0 DI:88 DR:07 D3:fd DO:05 IR:88 WE:1 ALU:05 S:ff A:45 X:06 Y:05 P:---I-- T:5
  84 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:05 IR:88 WE:0 ALU:05 S:fe A:45 X:06 Y:05 P:---I-- T:5
  85 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:05 IR:03 WE:0 ALU:04 S:fe A:45 X:06 Y:05 P:---I-- T:5
  86 - THR  THRD PC:01fd15 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:05 IR:ea WE:0 ALU:04 S:fe A:45 X:06 Y:04 P:---I-- T:5
  87 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:05 IR:f3 WE:0 ALU:04 S:ff A:44 X:05 Y:0a P:---I-- T:4
  88 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:05 IR:23 WE:0 ALU:04 S:ff A:44 X:05 Y:0a P:---I-- T:4
  89 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:05 IR:5a WE:0 ALU:04 S:ff A:44 X:05 Y:0a P:---I-- T:4
  90 - PHY  PHA0 PC:01fd13 AB:0005ff W:0 DI:88 DR:07 D3:fd DO:04 IR:88 WE:1 ALU:04 S:ff A:44 X:05 Y:04 P:---I-- T:4
  91 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:04 IR:88 WE:0 ALU:04 S:fe A:44 X:05 Y:04 P:---I-- T:4
  92 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:04 IR:03 WE:0 ALU:03 S:fe A:44 X:05 Y:04 P:---I-- T:4
  93 - THR  THRD PC:01fd10 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:04 IR:ea WE:0 ALU:03 S:ff A:43 X:04 Y:08 P:---I-- T:3
  94 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:04 IR:f3 WE:0 ALU:03 S:ff A:43 X:04 Y:08 P:---I-- T:3
  95 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:04 IR:23 WE:0 ALU:03 S:ff A:43 X:04 Y:08 P:---I-- T:3
  96 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:04 IR:5a WE:0 ALU:03 S:ff A:43 X:04 Y:08 P:---I-- T:3
  97 - PHY  PHA0 PC:01fd13 AB:0004ff W:0 DI:88 DR:07 D3:fd DO:03 IR:88 WE:1 ALU:03 S:ff A:43 X:04 Y:03 P:---I-- T:3
  98 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:03 IR:88 WE:0 ALU:03 S:fe A:43 X:04 Y:03 P:---I-- T:3
  99 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:03 IR:03 WE:0 ALU:02 S:fe A:43 X:04 Y:03 P:---I-- T:3
 100 - THR  THRD PC:01fd15 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:03 IR:ea WE:0 ALU:02 S:fe A:43 X:04 Y:02 P:---I-- T:3
 101 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:03 IR:f3 WE:0 ALU:02 S:ff A:42 X:03 Y:06 P:---I-- T:2
 102 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:03 IR:23 WE:0 ALU:02 S:ff A:42 X:03 Y:06 P:---I-- T:2
 103 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:03 IR:5a WE:0 ALU:02 S:ff A:42 X:03 Y:06 P:---I-- T:2
 104 - PHY  PHA0 PC:01fd13 AB:0003ff W:0 DI:88 DR:07 D3:fd DO:02 IR:88 WE:1 ALU:02 S:ff A:42 X:03 Y:02 P:---I-- T:2
 105 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:02 IR:88 WE:0 ALU:02 S:fe A:42 X:03 Y:02 P:---I-- T:2
 106 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:02 IR:03 WE:0 ALU:01 S:fe A:42 X:03 Y:02 P:---I-- T:2
 107 - THR  THRD PC:01fd10 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:02 IR:ea WE:0 ALU:01 S:ff A:41 X:02 Y:04 P:---I-- T:1
 108 - THR  SYNC PC:01fd10 AB:01fd10 W:0 DI:ea DR:07 D3:fd DO:02 IR:f3 WE:0 ALU:01 S:ff A:41 X:02 Y:04 P:---I-- T:1
 109 - _T_  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:02 IR:23 WE:0 ALU:01 S:ff A:41 X:02 Y:04 P:---I-- T:1
 110 - THY  SYNC PC:01fd12 AB:01fd12 W:0 DI:5a DR:07 D3:fd DO:02 IR:5a WE:0 ALU:01 S:ff A:41 X:02 Y:04 P:---I-- T:1
 111 - PHY  PHA0 PC:01fd13 AB:0002ff W:0 DI:88 DR:07 D3:fd DO:01 IR:88 WE:1 ALU:01 S:ff A:41 X:02 Y:01 P:---I-- T:1
 112 - PHY  SYNC PC:01fd13 AB:01fd13 W:0 DI:ff DR:07 D3:fd DO:01 IR:88 WE:0 ALU:01 S:fe A:41 X:02 Y:01 P:---I-- T:1
 113 - DEY  SYNC PC:01fd14 AB:01fd14 W:0 DI:03 DR:07 D3:fd DO:01 IR:03 WE:0 ALU:00 S:fe A:41 X:02 Y:01 P:---I-- T:1
 114 - THR  THRD PC:01fd15 AB:01fd15 W:0 DI:ea DR:07 D3:fd DO:01 IR:ea WE:0 ALU:00 S:fe A:41 X:02 Y:00 P:---IZ- T:1
 115 - THR  SYNC PC:01fd0f AB:01fd0f W:0 DI:ea DR:07 D3:fd DO:01 IR:f3 WE:0 ALU:00 S:fc A:40 X:01 Y:07 P:---I-- T:0
 116 - _T_  SYNC PC:01fd10 AB:01fd10 W:0 DI:db DR:07 D3:fd DO:01 IR:db WE:0 ALU:00 S:fc A:40 X:01 Y:07 P:---I-- T:0
 117 - STP  SYNC PC:01fd11 AB:01fd11 W:0 DI:23 DR:07 D3:fd DO:01 IR:23 WE:0 ALU:00 S:fc A:40 X:01 Y:07 P:---I-- T:0

THR0 - ---  ---- PC:01fd11 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fc A:40 X:01 Y:07 P:---I--
THR1 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:41 X:02 Y:00 P:---IZ-
THR2 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:42 X:03 Y:01 P:---I--
THR3 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:43 X:04 Y:02 P:---I--
THR4 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:44 X:05 Y:03 P:---I--
THR5 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:45 X:06 Y:04 P:---I--
THR6 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:46 X:07 Y:05 P:---I--
THR7 - ---  ---- PC:01fd14 AB:000000 W:0 DI:00 DR:00 D3:00 DO:00 IR:00 WE:0 ALU:00 S:fe A:47 X:08 Y:06 P:---I--

000001f0: ff ff ff ff ff ff ff ff ff ff ff ff ff 07 01 40
000002f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 01
000003f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02
000004f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 03
000005f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 04
000006f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 05
000007f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 06
000008f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 07



T: on the right shows which thread is running
THR on the bottom shows the state of each thread at the end of the run
Lastly, is the dump of memory at the top of each stack, showing the thread number pushed to the top of each stack


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 5:24 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
65LUN02 wrote:
Ultimately, the additional number of gates in a real implementation is driven far more by the number of copies of the registers. Each set of A/B/X/S/P/PC is 8+8+8+7+24 = 55 bits, plus 2-3 more bits for a T register to hold the current thread index.

Eight sets of registered requires roughly the same number of transistors as the original NMOS 6502. I've not seen an accurate count of the CMOS 65C02, but I've seen that estimated as 5,000 transistors. So eight sets of registers is roughly 50% additional transistors.


In an FPGA, the register file is efficiently implemented using 1 LUT6 for 64x1bit memory. For 8-bit wide registers, you need 8 LUT6's, but then you get 64 registers, no matter how many you use. Extending my original register file for 8 threads would take no additional resources, for example.

Edit: I'm happy to see that my original design goal of making something relatively easy to understand/modify has worked out. The new style decoder makes it a lot easier to expand the instruction set compared to my original 6502 project with all the overlapping "don't care" bits.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 14, 2022 5:47 pm 
Offline

Joined: Wed Jun 29, 2022 2:15 am
Posts: 44
Have you dug into the Visual 6502 to see how closely your control bits match up with the actual chip? I suspect your logic is very close to theirs. Same with the corresponding states, although I've never seen a definitive guide of 6502 states to compare.

You mention in your README that you removed the extra cycles. Do you know why the 65(C)02 has those extra states? Were the designers being conservative in what could be done in a state, given they had no automated tools to measure timings? Or did they find some common logic that cause a bit of reuse traded off for an extra cycle?

For the 65C24T8, the biggest time suck was trying to get the thread switch to happen in two cycles instead of three. I'd bet $1 that you could squeeze out that extra state by twiddling just the right control lines. My brute force was to give up, add a bit of logic to decrement the PC in the new THRD state and add the _T_ noop in the new TYNC state so that the right opcode was run when returning to a thread and so the opcode from the new thread wasn't run twice.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 59 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: