Would a 32 bit evolution of 65816 stay accumulator based?

JustClaire · Post by **JustClaire** » Wed Jul 29, 2020 12:26 am

Hello,

I have been wondering if there was a 32bit evolution of 65816, would it stay accumulator oriented (have 1 32bit accumulator that can be subdivided into smaller ones like 816 for example) or would it have multiple full sized general purpose registers (for example 2 accumulators)?

What are your thoughts on the matter?

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jul 29, 2020 12:39 am

JustClaire wrote:

Hello,

I have been wondering if there was a 32bit evolution of 65816, would it stay accumulator oriented (have 1 32bit accumulator that can be subdivided into smaller ones like 816 for example) or would it have multiple full sized general purpose registers (for example 2 accumulators)?

What are your thoughts on the matter?

Some years ago, WDC published a preliminary data sheet for a 65C832, which had a 32-bit accumulator that could be reduced to 16 or 8 bits. The '832 never saw the light of day, as there really was no business case to support going forward with it—unlike the 65C816, there wasn't an Apple ready to place a volume order for product.

w65c832s.pdf: 65C832 Preliminary Data Sheet; (3.41 MiB) Downloaded 265 times

GARTHWILSON · Post by **GARTHWILSON** » Wed Jul 29, 2020 1:25 am

I think you could say that if it had additional general-purpose registers, it would no longer be a 65-family processor.

Others who know more about HLL compilers may correct me, but it is my understanding that the greatest reason reason that lots of registers were put in processors was to make it easier to write compilers for them. It does not necessarily improve performance though. The 1802 processor with all its registers was a real dog performancewise, the 65816 compared favorably to the 68K in the Sieve benchmark, and Sophie Wilson, chief architect of the ARM processor, said, "an 8MHz 32016 was completely trounced in performance terms by a 4MHz 6502." (The 32016 was National's 32-bit processor, having 15 registers, including 8 general-purpose 32-bit registers.) Jack Crenshaw, an embedded-systems engineer who wrote regularly in Embedded Systems Programming magazine said in the 9/98 issue that he still couldn't figure out why, in BASIC benchmark after benchmark, the 6502 could outperform the Z80 which had more and bigger registers, a seemingly more powerful instruction set, and ran at higher clock rates. (The 6502's zero page and improved indexed and indirect addressing modes no doubt helped.)

I hardly find the lack of registers to be a limitation. Ed observed, "With 6502, I suspect more than one beginner has wondered why they can't do arithmetic or logic operations on X or Y, or struggled to remember which addressing modes use which of the two. And then the intermediate 6502 programmer will be loading and saving X and Y while the expert always seems to have the right values already in place." And indeed we regularly see beginners making things more complicated than they needed to be, especially if they came from a background of more registers or a similar situation.

That said, I would still like to have a second X register, to use one as an index to virtual stacks while leaving the first one available. We've had several topics here about additional instructions we would like (which a 32-bit word easily accommodates) and about our visions of a 32-bit 65-family processor. Ed had a topic index post at viewtopic.php?f=1&t=4216 .

dmsc · Post by **dmsc** » Wed Jul 29, 2020 2:50 am

Hi!

GARTHWILSON wrote:

I think you could say that if it had additional general-purpose registers, it would no longer be a 65-family processor.

Others who know more about HLL compilers may correct me, but it is my understanding that the greatest reason reason that lots of registers were put in processors was to make it easier to write compilers for them. It does not necessarily improve performance though.

No, compiler simplicity is not the reason, as register allocation is one of the difficult parts of a compiler. The main reason is that memory is a lot slower than registers, so you can't build a fast processor without enough registers.

The 6502 was fast in it's day because it has a very fast memory bus - one memory access each cycle - and has it's zero page as a "slow register area".

But the same is the reason a 6502 based design can't be made faster. For example, look at an instruction like "LDA (12), X". That instruction reads from one register (X), reads from three memory locations (12, 13 and the address calculated from those) and writes to the accumulator. That means that you can't execute that instruction in less than 3 cycles (one for each memory access), even discounting the reading of the instruction word.

Now, if you had a 16 bit register "H", you could write " LDA (H),X ", performing the same operation but with only one memory access. In a high performance implementation, you could execute that in one cycle.

That is the reason that the ARM1 was faster than most competing CPUs: the architecture has a lot of registers and the memory access was fast.

The main problem with many registers is that you need to save the registers on interrupts (making interrupts slow) and that if you want to have multiple ports in the register file (i.e., being capable of reading and writing multiple registers at the same time) the register file consumes a lot of power in the CPU. The first was solved in the ARM by having two register files that are swapped on interrupts, so you don't need to save the registers. The second was solved by having only simple instructions that read at most two registers in one cycle and write one, but even so the register file is more than 20% of the chip area: http://www.righto.com/2015/12/reverse-e ... or-of.html

Have Fun!

GARTHWILSON · Post by **GARTHWILSON** » Wed Jul 29, 2020 3:23 am

Yes; but I would also argue that these don't necessarily give as much benefit as it would initially appear, as I showed above. SRAM is available in much faster speeds today than anything else you'll put on the bus. 10ns is run-of-the-mill today, and I've seen down to 6ns in outboard SRAM. Actually, that has been the case for many years. If it were on the same die with the processor, it would be even much faster; and in a modern deep-submicron silicon process, even faster. Maybe 1ns? Maybe measured in hundreds of picoseconds?

I think taking advantage of what you're saying requires much more complex instruction decoding, reducing the processor's maximum clock speed, unless you also get into the complexities of pipelines and having several instructions simultaneously in the process of getting decoded. The next step in peeling this onion is trying to do branch prediction to try to minimize the penalties of having to flush the pipelines, which becomes another reason why the complex processors are so difficult to write assembly language for, and compilers kind of created their own need. Obviously we do have very powerful processors today, so it is possible; but at what cost? 10,000-20,000-transistor processors have given way to processors with not just millions of transistors, but billions.

Also, I'm constantly doing nested interrupt service routines (particularly NMI interrupting the service an IRQ), so having two register sets is not enough to avoid saving registers for interrupt service.

As a caveat, I will say I like thing simple, in all areas of life. I don't like irrelevant graphics in web pages, I don't like home automation, I don't use a smartphone (for several reasons), I don't like modern cars with all their gizmos, etc.. My house is even too big, and now that the kids are grown and gone, I would like to scale down—waaaaaay down. So you can partly see where I'm coming from.

BillG · Post by **BillG** » Wed Jul 29, 2020 4:07 am

GARTHWILSON wrote:

Others who know more about HLL compilers may correct me, but it is my understanding that the greatest reason reason that lots of registers were put in processors was to make it easier to write compilers for them. It does not necessarily improve performance though. The 1802 processor with all its registers was a real dog performancewise, the 65816 compared favorably to the 68K in the Sieve benchmark, and Sophie Wilson, chief architect of the ARM processor, said, "an 8MHz 32016 was completely trounced in performance terms by a 4MHz 6502." (The 32016 was National's 32-bit processor, having 15 registers, including 8 general-purpose 32-bit registers.) Jack Crenshaw, an embedded-systems engineer who wrote regularly in Embedded Systems Programming magazine said in the 9/98 issue that he still couldn't figure out why, in BASIC benchmark after benchmark, the 6502 could outperform the Z80 which had more and bigger registers, a seemingly more powerful instruction set, and ran at higher clock rates. (The 6502's zero page and improved indexed and indirect addressing modes no doubt helped.)

From https://en.wikipedia.org/wiki/RCA_1802# ... ion_timing

Quote:

Clock cycle efficiency is poor in comparison to most 8-bit microprocessors. Eight clock cycles makes up one machine cycle. Most instructions take two machine cycles (16 clock cycles) to execute; the remaining instructions take three machine cycles (24 clock cycles). By comparison, the MOS Technology 6502 takes two to seven clock cycles to execute an instruction, and the Intel 8080 takes four to 18 clock cycles.

Like the 1802, the 68K also suffered from requiring more clock cycles per instruction than many of its competitors and only somewhat making up for it with more and wider registers. Its major advantage at the time was a large address space.

The 8080/Z80 may have more registers. but data has to be moved into the single accumulator to perform arithmetic operations. And though all three register pairs can be used to access memory, there is no way to specify a displacement to the address in the pair. Accessing fields within a data structure meant having to repeatedly modify the address in a register. The Z80 came with two index registers which have displacement capability, but they are one clock cycle slower and it is somewhat inefficient to get an address into or out of them.

The two almost equal accumulators in the 680x is very pleasant to use. The 6800 suffers from having only one index register.

I have programmed many processor architectures and have worked on toy compilers for some of them.

Too few registers is a hindrance, but there is a point of diminishing return. The TI 9900 and the 68K each offer sixteen registers; the AVR has thirty two though they are byte sized.

GARTHWILSON wrote:

That said, I would still like to have a second X register, to use one as an index to virtual stacks while leaving the first one available.

I would like an (absolute),Y addressing mode.

dmsc · Post by **dmsc** » Wed Jul 29, 2020 4:10 am

Hi!

GARTHWILSON wrote:

Yes; but I would also argue that these don't necessarily give as much benefit as it would initially appear, as I showed above. SRAM is available in much faster speeds today than anything else you'll put on the bus. 10ns is run-of-the-mill today, and I've seen down to 6ns in outboard SRAM. Actually, that has been the case for many years. If it were on the same die with the processor, it would be even much faster; and in a modern deep-submicron silicon process, even faster. Maybe 1ns? Maybe measured in hundreds of picoseconds?

Problem is, you are still limited to one access per clock cycle. In contrast, register files are multi-ported, topically allowing two reads and one write per cycle.

Quote:

I think taking advantage of what you're saying requires much more complex instruction decoding, reducing the processor's maximum clock speed, unless you also get into the complexities of pipelines and having several instructions simultaneously in the process of getting decoded. The next step in peeling this onion is trying to do branch prediction to try to minimize the penalties of having to flush the pipelines, which becomes another reason why the complex processors are so difficult to write assembly language for, and compilers kind of created their own need. Obviously we do have very powerful processors today, so it is possible; but at what cost? 10,000-20,000-transistor processors have given way to processors with not just millions of transistors, but billions.

No, actually the 6502 is not simpler than newer minimal RISC CPUs.

You can compare a FPGA implementation - I have a cheap "upduino" board, and build my own 6502 based computer. Using Arlet verilog 6502, I can get the 6502 core up to 16MHz, using abut 900 LUT for the CPU. On the other hand, a RISC-V 32bit core uses about 2000 LUT and clocks at about 20MHz (see https://github.com/grahamedgecombe/icicle )

And the RISC-V core executes up to one instruction per cycle, instead of the multiple cycles per instruction in the 6502, so it is *much* faster, and 3 times the number of LUT by going from 8 bit to 32 bit and from 5 to 16 registers is not that much.

The reason the 6502 core can't run faster is that the results from the ALU are directly feed to the memory address bus, so the critical path is pretty long.

Quote:

Also, I'm constantly doing nested interrupt service routines (particularly NMI interrupting the service an IRQ), so having two register sets is not enough to avoid saving registers for interrupt service.

The ARM cores also support two types of interrupts, FIQ is equivalent to the NMI and uses the alternate registers, and IRQ acts like the 6502 interrupts, you have to push the registers on the stack.

IMHO, I find ARM assembly simpler than the 6502 assembly, you don't have the "indirect" addressing modes, all registers are similar, and you can use condition codes to avoid jumps.

I really think that the ARM cores (in particular the old integer only ones) are the true evolution of the 6502 simplicity

Have Fun!

GARTHWILSON · Post by **GARTHWILSON** » Wed Jul 29, 2020 6:57 am

dmsc wrote:

Problem is, you are still limited to one access per clock cycle. In contrast, register files are multi-ported, topically allowing two reads and one write per cycle.

There has been talk of putting ZP and the hardware stack onboard on their own buses you could access one or both of those at the same time as accessing outboard memory. Then a PHA for example could take only one cycle, being executed while the next instruction is being read. Various challenges would have to be worked out; and I don't know if there's any solution for the '816 where the ZP (now called "direct page") is not locked to page 0, but can start anywhere in the first 64K, and can overlap the hardware stack area if desired, the stack area also being able to be dozens of KB if desired. Perhaps having two outboard buses and dual-port RAM is the way to do it. I have not seen large dual-port RAMs though.

Quote:

No, actually the 6502 is not simpler than newer minimal RISC CPUs.

You can compare a FPGA implementation - I have a cheap "upduino" board, and build my own 6502 based computer. Using Arlet verilog 6502, I can get the 6502 core up to 16MHz, using abut 900 LUT for the CPU. On the other hand, a RISC-V 32bit core uses about 2000 LUT

How many transistors are in a LUT? I'm not sure the number of LUTs is a valid comparison if you weren't limited to programmable logic, because the 6502 uses bidirectional pass transistors that are a lot simpler than the programmable logic required to do a similar job.

Quote:

And the RISC-V core executes up to one instruction per cycle, instead of the multiple cycles per instruction in the 6502, so it is *much* faster, and 3 times the number of LUT by going from 8 bit to 32 bit and from 5 to 16 registers is not that much.

The meaning of speed comes in somewhat of a logarithmic curve. 30% faster is almost imperceptible. Twice as fast is starting to be significant, but definitely not monumental. Four times is quite significant, but so is the programming efficiency. For example, I believe the C64 and other computers of decades ago could have been much more powerful with the same hardware if modern software methods and tools had been available back then, and certainly with newer hardware attached to the basic 1MHz computer (like the 512KB REU which dramatically sped up GEOS on the C64).

Quote:

IMHO, I find ARM assembly simpler than the 6502 assembly, you don't have the "indirect" addressing modes, all registers are similar, and you can use condition codes to avoid jumps.

That sounds like the processor in the PIC16 microcontrollers which I have used very extensively at work. There is no indirect addressing mode, only a file-select register (FSR) and the indirect file (INDF) register. Actually, INDF is not a physical register; when you read it or write to it, you're actually reading or writing to the RAM address it points to, which you told it when you wrote to the FSR. Since there's no indexing either, so if you want indexing and indirection, it again means extra instructions. It's very inefficient. There's only one FSR, not lots like you can have in the 6502's ZP. It was a big problem, so in the PIC17, they added one more FSR, but still 8-bit, and PIC18 added a third one, and the 18's are 12-bit.

Their separate address and data buses (Harvard architecture) allow reading both at once; but most instructions still take four clock cycles (which they call one instruction cycle) just like the 6502's average; but since the instruction set is so poor, it takes twice as many instructions to get a job done, and therefore twice as many clock cycles—if indeed the PIC can do it at all. (And BTW, the Harvard thing presents its own set of programming problems that don't exist on the '02.) And to get the op code and data in a single instruction word, they had to sacrifice the upper address bits and divide the RAM into banks and the program ROM into pages, and managing them is an inefficient mess, requiring a lot of extra instructions.

The PICs also don't have a conditional branch instruction, only instructions to conditionally skip an instruction if the tested condition is false (IOW, "If this is true, the don't execute the following instruction" backwards logic), and the instruction you skip is usually a jump, because you can't skip multiple instructions. So a conditional branch takes 8 cycles if the jump is not taken (compared to the two cycles the '02 takes), or 12 cycles if you do jump (compared to the three cycles the '02 takes).

I know that contrary to the fact that Microchip called the PIC16 a RISC, it is not a good example of a RISC in the sense that it does not have a lot of general-purpose registers, and it does not finish one instruction per clock cycle. My point is that the things people associate with performance don't necessarily add up to performance, or if they do produce a gain in performance, that gain is not necessarily monumental. There's a place for things like the AMD Opteron; but I would also like to see the 65 family taken a lot further while still strongly retaining the 65 flavor.

BigEd · Post by **BigEd** » Wed Jul 29, 2020 8:20 am

One could look at the '832 (proposal) and the original ARM for two ways to extend a 6502 philosophy. For those who've programmed the early ARM at assembly level, and come from a 6502 background, it does feel very familiar and comfortable.

The ARM designers bore a few things in mind
- complexity is a great cost to carry especially in a small team with a tight deadline
- effectively useful memory bandwidth is the biggest influence on performance
- for fast interrupt response, shadow registers and simple instructions is a good way to go

The clock speed of the CPU is not as important as the rate at which memory accesses can be made: hence a lot of the confusion with the Z80 vs 6502, and indeed the 1802. It's memory cycles which should be counted.

ARM was a big win for a few reasons
- 32 bit wide databus
- a pipeline that's long enough to be useful but not so long it's costly and complicated
- simple regular architecture, easy to design and get right, easy to learn and use effectively
- use of DRAM page mode to improve memory bandwidth by (I think) about 30% in practice
- not constrained by backward compatibility

Not that ARM couldn't be improved on, but that's what was built. And it did come out as a big win over 68000 (especially 68008!) and NS32k. ARM has enough regularity and enough registers to be a good target for compiled code, which is another kind of win.

(Acorn, and particularly Sophie, seem to have been quite fixed in their thinking about using interrupts in a time-critical way, where other CPUs would use DMA. Having made that choice, the very long worst-case interrupt timings of the complex CPUs were a problem for them.)

The '832, on the other hand, has
- backward compatibility as a constraint
- an 8 bit bus
- no extra registers (or cache, or anything else) to relieve memory bandwidth shortage
- even fewer engineering resources and cashflow to make a go of it

For that last point: Acorn's business went through ups and downs, but was always putting resources into projects, had cashflow from the educational computer business, and could draw talent from the world-class university in the same city.

tokafondo · Post by **tokafondo** » Wed Jul 29, 2020 12:34 pm

Hi. Interesting conversation. I only step in here to ask if it would be possible to create a 32 bit 65xx compatible processor, with no WDC technology at all, somehow like Compaq did in the day, creating an IBM PC compatible BIOS.

BigEd · Post by **BigEd** » Wed Jul 29, 2020 12:56 pm

There's no problem at all with intellectual property, other than trademark (branding), as discussed recently.
viewtopic.php?f=1&t=6217

Be sure to check the various threads linked in the index post (linked above by Garth)
viewtopic.php?f=1&t=4216

Edit: I am of course quite fond of the 65Org32 idea - it's inefficient but it's straightforward to define and build. And the OPC machines have been great fun and are also relevant here:

Quote:

Our OPC7 machine might be of interest, as being very much reduced complexity, and yet having something of a 6502 feel to it. Indeed, we did some transliteration of code from 6502 to the OPC machines with success. By the time we got to OPC7, we had a 32 bit machine with all instructions being one word long, and in one of just two formats. This series of machines has proved to be quite fun to program: small instruction set, sufficiently powerful, and a spacious register file. One of the motivations was the challenge to make something of about the complexity of the 6502 - the sort of thing which could possibly have been a contender in the day.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jul 29, 2020 2:26 pm

dmsc wrote:

For example, look at an instruction like "LDA (12), X".

There isn't any such instruction in the 6502.

BigEd · Post by **BigEd** » Wed Jul 29, 2020 3:53 pm

BigDumbDinosaur wrote:

dmsc wrote:

For example, look at an instruction like "LDA (12), X".

There isn't any such instruction in the 6502.

Fortunately, it's completely obvious what is meant, and the argument applies just the same to LDA (12),Y.

Possibly of interest in relation to building wider machines but without wide memory interfaces, @hoglet found that a modest 64-line instruction cache made a major improvement to our 16 bit CPU when operated with an 8 bit wide memory.

On the topic of cache, for any implementation that isn't an FPGA, one can consider how to make use of a given transistor budget. For example, using the 4x transistor count of a Z80 and applying it to a 6502 might provide a useful amount of cache. The microcode in a 68000 can be considered as a choice: the same area budget in an ARM could be used for cache. Indeed, one can view the cached instructions on a RISC as being comparable to an automatically-tuned program-appropriate form of microcode. (FPGAs are an exception, because they are pretty large compared to our simple CPU designs, and until they are full, the size of a design doesn't change the cost at all. But in all cases, a large design is likely to support only a slower clock, and have a corresponding performance hit - unless it happens that some other part of the system limits the clock.)

To look again at the headline question: "Would a 32 bit evolution of 65816 stay accumulator based"... I think the answer is yes. To substitute a register based instruction set would be to change the machine so much that it wouldn't really be an evolution of the '816. Only to add a second accumulator would barely change the machine from being accumulator based. It might be interesting to make a machine with four registers, corresponding roughly to the facilities of the '816's A, X, Y and SP, but usually a register based machine would choose at least eight. The question would be a tradeoff of the performance of a slightly denser instruction coding, versus more memory traffic juggling values.

barrym95838 · Post by **barrym95838** » Wed Jul 29, 2020 6:23 pm

The ability to ADX #10 or SBS #25 or EOY #$55 would IMO be a welcome addition to the 65xx series, and I do that in my still incomplete 65m32a processor. To me, the 65xx "feel" is in the three-letter mnemonics and the single operand structure of the assembly language, not necessarily in the number of accumulator-capable registers. My largest divergence is in the elimination of indirect addressing, which is a big one, but one I can live with, due to the enhanced capabilities of the direct and immediate modes.

GARTHWILSON · Post by **GARTHWILSON** » Wed Jul 29, 2020 7:38 pm

barrym95838 wrote:

and I do that in my still incomplete 65m32a processor. [...] My largest divergence is [...]

For newcomers: You can read about Barry's 65m32 design at http://anycpu.org/forum/viewtopic.php?f=23&t=300 .

Would a 32 bit evolution of 65816 stay accumulator based?

Would a 32 bit evolution of 65816 stay accumulator based?

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based

Re: Would a 32 bit evolution of 65816 stay accumulator based