RISC on 8 bit mostly means: One cycle per instruction

C16with64K · Post by **C16with64K** » Tue Jan 23, 2018 11:39 pm

I looked at that RCA CPU also and on both CPUs I cannot see much bloat that could be reduced. Okay the 6502 opcodes seem to have been layed out randomly. When I first read about RISC it was about cache and being able to specify 3 regs in one opcode. And like on a 486 years later, every instruction takes one cycle. On an 8-bit CPU (without much space for cache on die), a Load or Store (immediate ones also) instruction blocks of the instruction decoder (von Neuman architecture). In order not to stall the CPU, two register instructions have to be fitted into one opcode. For example:
(I hope you like my LambdaExpression C# like syntax)
Base+Index -> Address
Index++ , [Address] -> DataRegister

or

++Counter
Counter->[ StackPointer] , StackPointer++ -- inspired by ARM Cortex M1

I also need this

DataRegister->[ StackPointer] , StackPointer++
newStuff+SomeMore -> DataRegister -- operation from same opcode. In the meantime the instruction prefetch of the 6502 is refilled

I count 4 register references. To avoid immediates, a baseRegister (BX) is needed. Basically we have 4 register references, 2 bits each, and are out of bits. We have not even specified the two operations, which also need 3 bits each.

Opcodes without LoadStore have no second operation. Instead they have more registers to choose from.

Our code may have to interact with methods in another object with code at a CodeBaseAdress. For this the BackgroundProgrammCounter can be loaded with CodeBaseAdress+Immediate. CodeBaseAdress is a middle ground between generalPurpose (there it functions as SourceAdress) and programmCounter.
I do not want to mix general purpose registers with the program counter completely like in the RCA CPU because there is not enough code space.

Also relative jumps should avoid an immediate value:
BitPattern: 1=Jump, Condition=[0=always, 1=Z, 2=C, 3=N], Direction, Length-1 -- jump on self is nonsense, jump on next also, jump on previous is almost useles.

Jumps to subroutines need a ProgramCounterBackupReg. ProgramCounter and Reg are just swapped. (two or one) Immediate Load instructions load the target address. After the jump the CPU pushes the return address onto the stack when the bus is free. For this some opcodes have 2 operations but no Load/Store. If a ProgrammCounter swap appears before push is finished: Wait State(s)

In conclusion, an 8-bit CPU needs to have a lot of decoding logic which decodes (more or less arbitrary) 8 bit opcodes to more useable 16 bit microcode. With this in mind I feel that the 6502 is suboptimal and may have less fun writing code for it.

GARTHWILSON · Post by **GARTHWILSON** » Wed Jan 24, 2018 1:31 am

This is interesting...but what is your goal?

commodorejohn · Post by **commodorejohn** » Wed Jan 24, 2018 1:50 am

C16with64K wrote:

With this in mind I feel that the 6502 is suboptimal and may have less fun writing code for it.

Then, um, what are you doing here?

GARTHWILSON · Post by **GARTHWILSON** » Wed Jan 24, 2018 2:30 am

It sounds like you might be oriented toward a compiler-friendly architecture. I understand the 68000's designers consulted with compiler writers to try to produce a processor that was more compiler-friendly. They were apparently quite successful at that, although all the registers and the wider bus did not result in better performance. The 65816 outperformed the 68000 in the Sieve benchmark even though the 68000 had more and wider registers and a wider data bus, and the 6502 and 65816 completely blew away the 68000 in interrupt performance. Back in the day of early home computers, a 1MHz 6502 also consistently outperformed the 4MHz Z80 in BASIC benchmarks even though the Z80 had the higher clock rate, and more and wider registers.

Is it even possible to do one instruction per cycle on an 8-bit machine? The PIC16, which I have used in many products I've brought to market, has a Harvard architecture with an 8-bit data bus but a 14-bit instruction bus, and still takes 4 clocks per instruction (an instruction cycle is four clock cycles), 8 clocks for a few of the instructions, and because of its instruction set being poor compared to the 65c02's (not to mention the 65816's), the PIC16 does not perform as well as the 65's do. And if you need more variable space than you can get in one of its small banks, fooling with the bank bits further reduces efficiency. Add to that the overhead incurred if you need more program space than what fits in one page. It's really quite a mess. If you have to grab the op code and operand all at once with a narrow bus, the banking and paging problems seem unavoidable to me for all but the smallest applications.

There have been several discussions and a few efforts toward extending the 6502/816 to 32 bits. One is the 65Org32 which extends all buses and registers (except the status register) to 32 bits, and the bank and direct-page registers become merely offsets. This is not a one-cycle-per-instruction thing though. The 32-bit effort to a somewhat 65-flavor processor that I would give the best chance of becoming reality at this point is Mike Barry's 65m32. It does merge operands and op codes into a single 32-bit word, all fetched at once on the data/instrucion bus.

Dr Jefyll · Post by **Dr Jefyll** » Wed Jan 24, 2018 3:17 am

C16with64K wrote:

With this in mind I feel that the 6502 is suboptimal and may have less fun writing code for it.

You may be right. Your own experience is something you'll have to judge for yourself. But I can safely say there are others here who do find the 6502 (and 65816) fun

C16with64K wrote:

(more or less arbitrary) 8 bit opcodes

Can't argue with you there -- the opcode map is quite perplexing. Make no mistake; there are some very clear patterns there. But there also are parts where it seems as if the designers did their fine-tuning with a chainsaw.

For example: dunno if you noticed, but there are entire columns on the original 6502 opcode map which are unused. According to one line of reasoning (BigEd could provide the references for this) the 6502 was originally intended to have two accumulators. But they found this would've resulted in too large (and hence too expensive) a chip. So a brutally ruthless decision was made. The chip got very rudely downsized quite late in its development -- leaving some gaping holes in the opcode map. It reminds me of the 19th-century steamboat races where the desperate contestants sometimes resorted to burning the furniture as fuel! Not exactly graceful or systematic, but sometimes that was what it'd take to win. And 6502 did win! (dramatically undercutting Motorola and Intel and setting a new price point that resulted in the birth of personal computing)

-- Jeff

BigEd · Post by **BigEd** » Wed Jan 24, 2018 5:39 am

Welcome, C16with64K... I think possibly you intended to reply to an existing thread, perhaps
viewtopic.php?f=1&t=1419
(Edit: and see my link to many other related discussions, on page 13.)

I think what you're saying is that an 8-bit opcode has barely any room to specify operation and operands, so packing a two or three operand instruction, RISC-style, into 8 bits, is a challenge. I'm sure you'd be right about that! As an 8 bit address is a very severe limitation, you either need multi-byte instructions or to marshal addresses in registers. In a strict single byte instruction layout you can't even fit 8 bits of constant into an instruction, so that marshalling is going to be quite fiddly.

But probably it's better to talk about particular machines - actual ones, or paper designs - than to get tied up in terminology, as to whether something qualifies as RISC, or as 8 bit.

C16with64K · Post by **C16with64K** » Wed Jan 24, 2018 6:38 am

gotta go to work soon, but:

@BigEd
I think no thread was much more specific to my topic than any one else. I think I googled about this for some time. Will have a look at it.

@GARTHWILSON
Goal: mute all this discussion: "I want more regs". No you can't. Also I wanted to understand modern CPUs. For me their branch prediction and stuff always looked like zero-sum game. Also since Blender doesn't recreate me as hoped, I was tinking about doing a race game for C16 (hence my LogonName).

Some Errata:
In 8bit land addressing is of course:
[BH.X] -- paging

or

BL+X -> AddrL
Carry+BH -> AddrH, [YH:YL]-> DL -- hiding 16bit arithmetic behind memory access

Also I wanted to make a concise note why this SPARC register window craze went nowhere, but the stack survived the test of time.

Instead of CISC operations no one understands, the use of a VeryLargeInstructionWord allows one to program using the instructions. That way one would program in C, but may write our jumps/branches inline , preload jump addresses manually to give the compiler a hint. The compiler then tries to put together the puzzles pieces of the opcode set. This involves register (re-)naming, instruction reordering (in case of , not so much). Since this is NP-complete, a source revision control system would always be used and only changed code is recompiled. Calling convention is: By register or inlining (for very small code and or where regs can be assigned flexible). All compiled functions have a list of destroyed regs. For a fresh code download compiling even small programs takes one night. You could also let the compiler run as long as you want to find better combinations, or import Data from trace to hint at the 10% of code used 90% of the time. But beware worst-case: Quake was optimized for worst case not for normal case.

And for RISC: I do not understand how you could read and write on the same bus in one cycle. We do not have enough die area for two busses. At most there could be two sets of registers each which their own bus. For example Akkumulator latches the output from ALU. Everything else is on bus. But then how could AX be used as source and target in every 8bit CPU? There need to be two and a bridge to toggle between them. Two busses do not seem that more complicated. This reduces code space.

White Flame · Post by **White Flame** » Wed Jan 24, 2018 6:55 am

Instructions don't "take 1 cycle" on the original RISC design, nor in modern systems. They're a pipeline in which the pipe can take in (at least) 1 instruction per cycle in a continuous stream, but each instruction takes multiple cycles from hit-the-decoder to actually committing its effects as it works its way through. The 6502 has some aspects of overlap between instructions, but isn't generally considered properly pipelined.

The biggest aspect of how many cycles a 6502 instruction takes is the memory bus. It takes 1 cycle per byte of instruction opcode/operands, plus 1 cycle per read or write for addressed memory accesses, averaging somewhere over 3 cycles per instruction in common use, which is fewer cycles than its contemporaries. The "dead" cycles where carry needs to propagate across a 16-bit value, or where 1-cycle instructions can't overlap, are a known deficiency in efficiency (and are optimized away in the 65c02 etc), but it doesn't make a lot of sense to say that's a failing of the ISA. It's a transistor/architecture level speed constraint.

If you are working with 16-bit address operands, which is common for general purpose 8-bitters, then you'll have 2 cycles spent on an 8 bit bus per operand no matter what. The only real way to avoid that is to have a number of explicit 16-bit address registers on-chip, which wasn't in the budget for the 6502, and would only help if they were reused often because they'd still have to be loaded with 16-bit literal operands at some point. But even if you did have them, then the address modes & ISA would become more complex with those additional registers, not simpler to decode, and opcodes would very likely grow, losing the advantage of dropping 16-bit operands, and not getting any closer to 1 cycle per instruction.

[edit: as you're now talking VLIW, I doubt you're talking about an 8-bit data bus anymore? That's really a completely different class of chip than the classic 8 bits, and would not have anywhere near the code density required to fit complex programs in 4k-64k computers with ~1MB/sec bus bandwidth.]

ttlworks · Post by **ttlworks** » Wed Jan 24, 2018 7:42 am

C16with64K wrote:

With this in mind I feel that the 6502 is suboptimal and may have less fun writing code for it.

_Another_ boring RISC versus CISC debate ?

6502 was introduced in 1975.

When taking a look at advertisements for RAM chips from that time frame, and converting the prices to nowadays currency,
1kB of RAM probably did cost you more than 50€.
So back then using a CISC instruction set instead of RISC in order to be able to go with as few RAM as possible made a lot of sense.

Also, the 6502 initially wasn't intended to be used in computers, but in "embedded" applications (cash registers etc.) instead.
But because the 6502 had the best price\performance ratio in that time frame, it became interesting for hobbyists out to build their own computers.

So for a RISC versus CISC comparison, you better compare the 6502 to the PIC16 family of microcontrollers.
Writing machine code for the 6502 _certainly_ is a lot more fun than writing machine code for PIC16...

...Why yes: there were RISC CPUs with two bus systems.
For instance: TMS320C30 DSP from TI. (Datasheet page 12.)

Druzyek · Post by **Druzyek** » Wed Jan 24, 2018 7:46 am

Quote:

Is it even possible to do one instruction per cycle on an 8-bit machine?

Some "modern" (as in produced in the last 10-15 years) 8051 variants like the DS89C450 claim to be single cycle. The original 8051 architecture took 12, 24, or 48 cycles depending on the instruction, and the new ones take 1-4 to do the same work, which is about 10x faster on average.

I tried comparing BCD add functions in the Kowalski simulator and an 8051 simulator and came to the conclusion that a single cycle 8051 would take less cycles than an 8051 (results here). After poking around the datasheet I saw that the DS89C450 would need extra cycles to access external memory in single-cycle mode, so I'll have to amend my conclusion when I try the test on a real chip. In any case you can run that chip at at least 33MHz, so I would expect 2-3 times the performance of a 6502 at 14MHz. I don't like doing all my indexing and indirection by hand, though. 6502 assembly is still a lot more fun

GARTHWILSON · Post by **GARTHWILSON** » Wed Jan 24, 2018 7:59 am

Druzyek wrote:

In any case you can run that chip at at least 33MHz, so I would expect 2-3 times the performance of a 6502 at 14MHz.

How's the instruction set though, including addressing modes? I ask because as I posted further up, some other processors take a lot more cycles to do a job than the 65C02 does, even ones that would initially appear to do better—meaning you can't just go on clock speed.

Druzyek · Post by **Druzyek** » Wed Jan 24, 2018 10:08 am

GARTHWILSON wrote:

Druzyek wrote:

In any case you can run that chip at at least 33MHz, so I would expect 2-3 times the performance of a 6502 at 14MHz.

How's the instruction set though, including addressing modes? I ask because as I posted further up, some other processors take a lot more cycles to do a job than the 65C02 does, even ones that would initially appear to do better—meaning you can't just go on clock speed.

TL;DR-For my test on simulators 6502 takes 708 cycles and 8051 takes 493, which is probably 700-800 on a real chip. 8051 hand coded addressing modes are a pain but possibly faster than 6502.

The way I went about it was to write the BCD add function for both chips (since that's what I want to use them for) and use the simulators to count cycles. The 6502 took 708 and the 8051 took 493. For a standard 8051 you can just multiply 493 by 12 to get the number of machine cycles but the chip I'm using is "single-cycle" meaning one machine cycle, not 12, per instruction cycle. Not all of the 1 cycle instructions take only one machine cycle on the new chip, which is why you get "up to" a 10x instead of 12x speed up on the new version. ((493*12)/10 ≈ 592). The movx instruction (which normally takes 2 instruction cycles = 24 machine cycles) for accessing external memory also needs extra cycles since it has to set an external buffer driving the high byte of external memory before read/write, which you can't do in 2 cycles (although you can set it to skip writing the buffer if the value hasn't changed.) You also get a penalty when reading/writing external memory on one cycle then externally fetching an instruction on the next. If you run from the internal 64kb flash you never get that penalty. Those factors make it hard to say for sure how many cycles it will take on the real chip but I imagine another 10-20% over what the simulator shows (guessing 700-800) compared to 708 on 6502.

The instruction set is totally oriented toward microcontrollers and the first 256 bytes of RAM, which is always on-chip. All of the calculation operations (add, shift, xor, etc) work on that address range. Other than direct the only addressing mode is indirection through one of two 8-bit registers limited to the first 256 bytes. To address the rest of the 64k of RAM there are two 8 bit registers that combine to form a 16 bit pointer, DPTR, but all you can do with it is load and store with the accumulator. If you want indirection or indexing you have to calculate it all by hand and copy it to those two registers. The good news is that modern chips offer a second pointer so you can copy between two locations without constantly reloading the pointer. You can also turn on auto-switch and auto-increment.

I worked on some macros to emulate addressing modes on the 8051. For example, on the 6502:

Code: Select all

LDA Data
LDY #Offset
STA (Address), Y

LDY takes 2 cycles and the STA takes 6 for a total of 8. Here is what I used on the 8051:

Code: Select all

IndexDPTR0 MACRO DPTR_copy, Index 
  clr C 
  mov A, DPTR_copy ;Load low byte of saved pointer
  add A, Index
  mov DP0L, A ;Store in low byte of external mem pointer
  mov A, LOW(DPTR_copy)+1
  addc A, #0
  mov DP0H, A
 ENDM
 
 IndexDPTR0 Address, #Offset ;Set up external memory pointer
 mov A, Data
 movx @DPTR, A ;Save to address in external mem pointer

Each line of the macro takes one cycle and the movx is 4, which is 11 compared to 8 on the 6502. The obvious downfall (other than being slower) of this is that it takes up way more program space on the 8051 if you don't turn it into a sub-routine. The advantage is that it saves cycles in incremental loops since the 6502 wastes cycles reloading the target address in STA (Address),Y every time through the loop. Imagine this loop body on the 6502:

Code: Select all

LDA (Source), Y ;5-6 cycles
STA (Dest), Y ;6 cycles
INY ;2 cycles

This totals 13-14 cycles. On the 8051 with Source and Dest in the two data pointers:

Code: Select all

;Switch to first data pointer
anl AUXR1, #0FEh ;1 cycle
movx A, @DPTR ;4 cycles
inc DPTR ;2 cycles
;switch to second data pointer
orl AUXR1, #1 ;1 cycle
movx @DPTR, A ;4 cycles
inc DPTR ;2 cycles

This adds up to 14 cycles but with auto-increment and auto-switch it is shortened to 8 cycles with only the two movx instructions.

It is a fun chip to play with since RAM and ROM are separate and you have 8 GPIOs so you don't have to worry too much about memory mapping and it is a lot faster than a 6502 as far as I can tell. I still like programming on the 6502 a lot better though

whartung · Post by **whartung** » Wed Jan 24, 2018 5:30 pm

GARTHWILSON wrote:

It sounds like you might be oriented toward a compiler-friendly architecture. I understand the 68000's designers consulted with compiler writers to try to produce a processor that was more compiler-friendly. They were apparently quite successful at that, although all the registers and the wider bus did not result in better performance. The 65816 outperformed the 68000 in the Sieve benchmark even though the 68000 had more and wider registers and a wider data bus, and the 6502 and 65816 completely blew away the 68000 in interrupt performance. Back in the day of early home computers, a 1MHz 6502 also consistently outperformed the 4MHz Z80 in BASIC benchmarks even though the Z80 had the higher clock rate, and more and wider registers.

How well does the 65816 compare against the 68000 when running software compiled with high level languages? A Sieve is a micro benchmark, the 6502 does well in straight line code, but struggles with "generic" function calls because of the argument passing overhead. The '816 should suffer less from this, however. One can comment about the efficiency of the compilers, but that's kind of the point of the 68K -- to be able to write good compilers.

IF the compiled code from high level languages shows a net win for the end user, then arguably the 68K design is more successful for a wider user case.

The interrupt performance makes sense simply because there is less state (fewer registers) to save.

Isn't there a chip with a "fast interrupt" that essentially only saves the PC, and leaves it to the interrupt routine to properly save any other registers it uses?

Druzyek · Post by **Druzyek** » Wed Jan 24, 2018 7:15 pm

Quote:

Isn't there a chip with a "fast interrupt" that essentially only saves the PC, and leaves it to the interrupt routine to properly save any other registers it uses?

The 6809 does: http://www.roust-it.dk/coco/6809irq.pdf

rwiker · Post by **rwiker** » Wed Jan 24, 2018 8:30 pm

Druzyek wrote:

Quote:

Isn't there a chip with a "fast interrupt" that essentially only saves the PC, and leaves it to the interrupt routine to properly save any other registers it uses?

The 6809 does: http://www.roust-it.dk/coco/6809irq.pdf

I think the ARM does, too (or maybe that was just the early versions).

The ND 1/10/100 series of minicomputers had 16 interrupt levels, and a dedicated set of registers for each level.

RISC on 8 bit mostly means: One cycle per instruction

RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction

Re: RISC on 8 bit mostly means: One cycle per instruction