Improving the 6502, some ideas

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 17, 2009 4:05 am

bitwise, although VBR started the topic, he has said little since then and hasn't complained about any of my ideas, so let me present the "Reader's Digest" version of my proposal, because it is very much a 6502 (with the extra 65816 capabilities), just completely 32-bit. Basically a byte just becomes 32 bits instead of 8.

Not a RISC. Same Von Neuman architecture as 6502. Op code and operand are not combined. Most instructions remain the same, unlike the 65GZ032.
Has 6502's A, X, Y, S, P, and PC registers, and 65816's DP, DB, and PB registers-- but they're all 32-bit (although only about 8 status register bits would get used.)
Simpler, because everything is in ZP (or, more accurately, DP), because ZP has over 4 billion addresses. No operand requires more than one fetch. Even the 65816's bank boundaries are gone.
Since the data bus is 32-bit, there will not be separate 8-, 16-, and 32-bit modes like the 65832 had.

The what-about's are addressed in earlier posts, like that 32-bit-only is not a problem for 8-bit I/O ASCII data.

It would not have an emulation mode to run old 6502 code directly, but your programming and construction knowledge does transfer directly. There's almost nothing new to learn.

Trying to emulate one with a microcontroller with nearly 80 I/O pins (10 8-bit ports) would be extremely slow, like having phase 2 be a few tens of kHz; so that's out of the question.

I'll try to post some code examples later.

OwenS · Post by **OwenS** » Fri Jul 17, 2009 10:05 am

I still don't understand why you want to go for a word addressed architecture over a byte addressed one. With a byte addressed architecture, you just need one opcode for each sized load (Or flags somewhere indicating the load size), and a bit of logic somewhere to zero/sign extend values.

A major problem I can see for you though is opcode alignment. If instructions are 8-bits, and followed by a 32-bit literal, then most of the time the literal is going to be unaligned and you're going to be spinning for at least a couple of cycles loading it.

If you want to avoid that kind of delay, you need a prefetch buffer or such; of course, now were heading into a pipelined architecture. Admittedly, pipelining isn't that much more complex than not pipelining. The main problem is interlocks to ensure an instruction doesn't enter the pipeline before one it's dependent upon finishes, and duplication of functionality in different stages (The second is more an issue with CISC style architectures - the RISC one I'm designing doesn't really have this issue).

BigEd · Post by **BigEd** » Fri Jul 17, 2009 10:41 am

I read a little about coldfire the other day: it seems it was a simplification of the 68k, which allowed for smaller faster implementation. Worth a look. They chose to go for variable-length instructions, for the sake of code density.

But I think the 32-bit byte idea here will have every fetch, and therefore the opcodes, be 32 bits. The number of memory accesses will be very like the 6502's, with the extra width being useful for a proportion of the time.

In the interest of simplicity, and similarity with 6502, memory and pincount are being thrown in.

(I suspect there would need to be a sign-extend for the case where a fetch accesses some 8-bit-wide data which is to be handled as signed.)

Of course a 32-bit opcode does allow for embedded operands and easy decode: increment can be extended into an add short literal for example. It would be possible to add 16-bit relative branching, maybe even 24-bit relative, but I think the idea would be not to do that: a 32-bit branch opcode followed by a 32-bit offset.

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 17, 2009 7:06 pm

.
Ed's got it. There are no alignment problems. On the 6502, ADC# takes two bytes and two clocks for any operand up to $FF. On the 65Org32, it takes two bytes (although they'll be 32-bit ones) and two clocks for any operand up to $FFFFFFFF.

It's true that many of the bits in the op code are not strictly needed, but the 6502 simplicity is kept, and I suspect that the wider op code field could simplify the instruction-decoding logic. It does however allow for some BBS/BBR/SMB/RMB-type instructions like the Rockwell and WDC 65c02's have where one of the operands is integrated with the op code, for limited use like an op code to shift left 22 bits with the barrel shifter instead of having to do ASL 22 times (or whatever number you need).

I've dealt with I/O that was 4-bit and even 1-bit with an 8-bit 6502 and there of course no problems with alignment or anything else. There are different ways to handle the rare sign-extension need that Ed mentions, but even though most of my 6502 programming is in Forth which routinely handles 16-bit cells, I don't remember ever having to extend the sign of an 8-bit number to a 16-bit one. 16- to 32-bit yes, but not very often. The Forth word S>D (single to double) does that.

When I program embedded controllers (and I've brought quite a few to market), 8-bit is really enough. And to run 6502 code, I will continue to use a 6502, so I don't need the bigger processor to have an emulation mode. Many on this forum don't get heavily enough into this kind of work to justify going to 32-bit, and that's ok; but it would open up a lot possibilities for my workbench computer that are more math-intensive and keep larger amounts of data while keeping the simplicity of the 6502.

OwenS · Post by **OwenS** » Sat Jul 18, 2009 1:03 am

It must be just me who feels that wasting 3/8ths of your code space is, well, wasteful.

The other thing is not supporting byte accesses is going to make porting software a nightmare. Much assumes that you can address stuff bytewise.

Even if leaving your instructions 32-bit wide, I see no reason for not implementing bytewise access. It will vastly simplify any string handling. And it's not that complex; a small amount of logic in the memory unit for zero and sign extending values, and for telling the bus how big an access your doing. Thats it. And with 24 otherwise wasted opcode bits, why not do it?! You can leave the registers 32-bit, and just let programs ignore the upper portions.

moonshine · Post by **moonshine** » Mon Jul 20, 2009 3:42 am

I just registered, but I have some ideas as well, many of them from some RISC processors. I am not proposing a RISC processor however because that's not what a 65xxx is anyway. It is possible that some think my ideas don't retain the spirit of 65xxx processors but I think they definitely do, while improving it at the same time. This is not a finished plan and probably contains some errors or stupidities. I don't propose anything like deep pipelining, out-of-order execution or large caches that make desktop processors so complicated nowadays, as this processor should fit in a FPGA. Constructive criticism is excepted and welcome

Here comes:

- First of all, processor would be 32-bit internally as far as registers are concerned, but data bus would be 16 bits for ease of implementation (and fewer pins would be required as well). I will explain "ease of implementation" shortly.

- There would be two accumulators, A and B, like in 6809. There would be four index registers, X, Y, Z and SP. Everything 32-bit, of course. I believe this would improve support for high-level languages and also make machine language programming easier. Address space should be flat and any banking is to be avoided.

- All instructions would be either 16-bit or 32-bit (maybe 48 bits in some cases) in length, with a 16-bit opcode and possibly a 16-bit (or 32-bit) data word. This combined with a 16-bit data bus would ensure there wouldn't be unaligned instructions, ever. Large constants could be put in a table or loaded with several instructions (LDA.W #HIWORD; ASLA #16; ORA.W #LOWORD) and index registers could be used for address calculations, but to ease machine language programming some 32-bit absolute/immediate instructions could be provided. None of those instructions should be needed in principle though.

- There should be 8-bit, 16-bit and 32-bit load/store instructions with separate opcodes. This also applies to instructions between an accumulator and a memory location.

- There would be ADQ (ADd Quick) that would replace INC/DEC and fit in 16 bits, like in 680x0. The range could be +/-8, covering all common indexing cases and being more powerful than double INC/DEC as proposed earlier.

- As there are 65536 opcodes available and maybe about 256 needed, there could be an optional conditional execution for every instruction, like in ARM. This could be used to eliminate branches over few instructions and would be zero-cost. There should still be enough space to encode shift counts etc. to an 16-bit opcode.

- Fast divide/multiply instructions would be provided if feasible. Floating point wouldn't be supported, except maybe by a separate co-processor.

- There could be a few IRQ vectors/lines as well, to make interrupts faster.

Putting on the asbestos suit..

GARTHWILSON · Post by **GARTHWILSON** » Mon Jul 20, 2009 7:15 am

Welcome, moonshine. I won't be able to post for a few days. I'm not trying to ignore you.

ChuckT · Post by **ChuckT** » Tue Jul 21, 2009 1:42 pm

OwenS wrote:

If you want to avoid that kind of delay, you need a prefetch buffer or such; of course, now were heading into a pipelined architecture. Admittedly, pipelining isn't that much more complex than not pipelining. The main problem is interlocks to ensure an instruction doesn't enter the pipeline before one it's dependent upon finishes, and duplication of functionality in different stages (The second is more an issue with CISC style architectures - the RISC one I'm designing doesn't really have this issue).

Why not have a wait / delayed jump instruction or subroutine that would wait x amount of clock cycles until the pipeline finishes?

OwenS · Post by **OwenS** » Thu Jul 23, 2009 12:07 am

A prefetch buffer is nothing to do with branches; it can be considered a dumb cache: It simply caches bytes after the currently executing instruction. It doesn't speculatively cross jumps, etc.

As for waiting for the pipeline to flush: In my arch, not needed. PC can be loaded by the decoder (and is because it's faster). Branches will probably cost extra clock cycles but thats because the prefetch buffer is empty, not because of pipeline flushes (The lack of an instruction to execute will cause the microcode unit to spin through NOPs).

I'm avoiding branch delay slots. They're nonintuitive, and evidence from processors which have them show they've proven, in the long run, to be a mistake.

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 24, 2009 6:28 am

Quote:

It must be just me who feels that wasting 3/8ths of your code space is, well, wasteful.

At this point my only memory-conservation concerns are that I only have 32KB of ROM on my workbench computer and 16KB of RAM, and although it has always been plenty for code, I would like a thousand times as much or more for data, with no bank or page boundaries. String memory won't be efficient at (usually) one 32-bit byte for each character; but since this is not for replacing my desktop PC, it's highly unlikely to be holding documents with megabytes of text. The large quantities of data I want will mostly be numerical, not text.

Doing things in assembly language is rather limiting though, and to get beyond those limits, you use higher-level languages, which is what I have in mind for the 65Org32. Forth is extremely memory-efficient-- far more so than any other language I know of-- and being able to handle the entire 32 bits at once sure cuts down the number of instructions needed to do a job where 16 isn't really enough, either because you're constantly needing access to memory outside a 64K boundary, or because you're constantly needing a data number range exceeding 16 bits, or both.

Quote:

You can leave the registers 32-bit, and just let programs ignore the upper portions.

Yes, there could be an extension on the op code that tells it to AND-out the high 24 bits that you fetch, instead of following the fetch with AND #000000FF. The person designing the processor's logic would have to tell us if the performance hit is less than having to do the ANDing here and there in the code. I thought of just having pull-down resistors on those data lines so they automatically get zeros if a device doesn't put data on them, but decided against it for reasons of speed and using the bus capacitance to prolong hold times.

Here's an example code comparison between '02 and how I envision the 65Org32. It's for a 32-bit LOOP word in indirect-threaded Forth which increments the index, compares it to the limit, and branches back to the beginning of the loop if you're not done looping.

First in 65Org32:

Code: Select all

CODE loop  ( -- )     ; loop (the internal) is compiled by LOOP (the immediate compile-only word).
        INC   1,S     ; Increment the loop index,
        CMP   2,S     ; and compare it to the loop limit.
        BEQ   quitlp  ; Branch if index has reached the limit.
contlp: LDA   (IP)    ; If exit condx not met, put the addr shown by the cell
        STA   IP      ; after "loop" and set the program counter back to the
        JMP   NEXT    ; top of the loop.

quitlp: PLA           ; If done looping, pull the loop index
        PLA           ; and limit off the stack,
        PLA           ; then pull the exit address off the stack.
        STA   IP      ; This needs to be done anyway, and faster than incrementing
        JMP   NEXT    ; IP past the branch address.

Now in 6502. LOOP is usually done in only 16-bit in 6502 Forth, but I wrote this because sometimes 16-bit wasn't enough. I called this one "2LOOP" for "double-precision LOOP", which is the precision we get with the few instructions above for the 65Org32. Although this handles 32-bit index and limit, it still can't run outside the 6502's 16-bit addresses space like the one above can. If it somehow could, it would be even longer. It would also be slightly longer for NMOS 6502 than for CMOS.

Code: Select all

CODE 2loop  ( -- )       ; 2loop is compiled by 2LOOP.
        STX   XSAVE      ; Save the data stack pointer so we can use X
        TSX              ; for return-stack-relative addressing.

        INC   101,X      ; Incr lowest byte of index on return stack, and
        BNE   1$         ; skip icrementing higher bytes if it didn't roll over.
        INC   102,X      ; If it did roll over, increment next byte,
        BNE   1$         ; etc..
        INC   103,X
        BNE   1$
        INC   104,X

 1$:    LDA   101,X      ; Compare incremented index to limit, starting with low byte.
        CMP   105,X
        BNE   contloop   ; If any differences, you're not done, so branch to continue loop.
        LDA   102,X
        CMP   106,X
        BNE   contloop
        LDA   103,X
        CMP   107,X
        BNE   contloop
        LDA   104,X
        CMP   108,X
        BEQ   quitloop   ; If it all matches, branch to finish the loop.

contloop:
        LDA     (IP)     ; To go around for another loop, put the contents of
        PHA              ; where IP points and put them into IP for NEXT to
           LDY  #1       ; go to the right place, ie, to top of loop.
           LDA  (IP),Y
           STA  IP+1
        PLA
        STA     IP
        LDX     XSAVE    ; Restore the data stack pointer
        JMP     NEXT     ; and end.

quitloop:
        LDA   109,X      ; Get the exit address from the return stack and
        STA   IP         ; put it in IP.  (The next cell after 2loop is the
        LDA   10A,X      ; addr of the top of the loop, which is not what we
        STA   IP+1       ; need right now.)
        TXA              ; Instead of using a big pile of PLAs, put the stack
        CLC              ; pointer in A to add to it to effectively remove
        ADC   #$0A       ; the 32-bit loop limit and index and the 16-bit
        TAX              ; end addr from the return stack.
        TXS
        LDX   XSAVE      ; Then restore the data stack pointer,
        JMP   NEXT       ; and end.

Without the headers, the 65Org32 version is 19 32-bit bytes (or fewer if some operands get merged with op codes) and the 6502 version is 91 normal (8-bit) bytes. That's 608 bits for the 65Org32 versus 728 bits for the 6502, so the processor with the 32-bit data bus actually made more-efficient use of memory. I didn't count the cycles, but clearly the 65Org32 will be faster.

If you did it in assembly instead of Forth, the 65Org32 code becomes, in its most basic form,

Code: Select all

        INX
        CMP   limit
        BNE   loop_top

This is half the length of the portion of the first listing that usually gets executed, although it might require a variable for "limit" (if it's not a constant) and it also doesn't cover you for using X for anything else like nesting or recursion. Doing it in 6502 for 32-bit gets dirty and long again.

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 24, 2009 6:31 am

.
moonshine, thanks for joining us, and welcome.

Quote:

- First of all, processor would be 32-bit internally as far as registers are concerned, but data bus would be 16 bits for ease of implementation (and fewer pins would be required as well).

If you're still advocating a 32-bit address bus (you haven't said), having a 32-bit data bus instead of 16-bit does not make as much difference percentagewise in the total pin count; but the 32-bit data bus makes some jobs much faster with the same speed of memory. Further, the internal instruction-decoding logic will be simpler and faster if memory accesses are all the same width, and that will allow a faster clock speed. The way to get there is to make the data bus the same width as the internal registers, so no fetch or store (including push or pull) ever has to go to the memory more than once. The 6502 has to go twice for indirect addresses and vectors. The 65816 has to go twice for a lot more things, and three times for a few things. That slows down the works. If you have a 32-bit data bus, the only thing requiring more than one memory access would be the possible 64-bit result of a 32x32-bit multiplication (and only if you need the high byte), and you would have high- and low-word registers, also used for a 64/32 division. I think this might be the only real place for a second accumulator. Having the 32-bit data bus might have its most important speed-up effect on interrupt latency and register-saving as in any kind of task-switching.

Quote:

- All instructions would be either 16-bit or 32-bit (maybe 48 bits in some cases) in length, with a 16-bit opcode and possibly a 16-bit (or 32-bit) data word. This combined with a 16-bit data bus would ensure there wouldn't be unaligned instructions, ever.

In 16-bit Forth on the 6502, we refer to alignment as making sure cells always start at an even address. It's not absolutely necessary, but simplifies a few decompiling-related things. In assembly it's no issue; but what you're saying still allows things to start on odd addresses, because you have 1-, 2-, and 3-word instructions.

Quote:

Large constants could be put in a table or loaded with several instructions (LDA.W #HIWORD; ASLA #16; ORA.W #LOWORD)

This is something I want to avoid, since I intend to be exceeding the 16-bit limit 98% of the time in Forth.

Quote:

- There would be ADQ (ADd Quick) that would replace INC/DEC and fit in 16 bits, like in 680x0. The range could be +/-8, covering all common indexing cases and being more powerful than double INC/DEC as proposed earlier.

Good idea, but it seems unnecessary with a 32-bit address bus. On the '02, double increments and decrements are necessary all the time in higher-level languages to get from one two-byte address or cell to the next. With the data bus the same width as the address bus, a single does the job.

Your last few things have been mentioned in previous pages, and I agree. Several other things you're proposing make the instruction-decoding more complex and slower. The 6809 had some nice features, but I think they were part of what kept the 6809 from ever getting anywhere near the speed of the '02 and '816. The fastest 65c02's I've heard of were inside custom ICs and topped out over two hundred MHz. The only off-the-shelf individual 65c02's and 816's (ie, not in microcontrollers or other custom ICs) being made today that I know of can all run at bus speeds of at least 14MHz.

Quote:

Putting on the asbestos suit..

I hope I don't come across as flaming, but I don't want the obligation to "be nice" to keep us from making our points. I don't want anyone else to back down either if they can see that I still didn't get their point.

BigEd · Post by **BigEd** » Fri Jul 24, 2009 9:08 am

Hi Garth
It looks like the 6809 begat the 68hc11 which begat the 68hc12. (Edit: hmm, it's a bit more complex than that.) It doesn't look to me like the architecture is particularly limiting the clock speed, more likely it's a marketing question as to why there is no 14MHz standalone version. So, I wouldn't think that's reason enough to avoid packing more power into each 32-bit opcode. It's best to avoid difficult-to-decode encodings, but with so many bits I doubt that's a concern.

More importantly, keeping things simple reduces the complexity of the project. I think that's a better motivation for deferring or rejecting ideas which would reduce code size and cycle count.

It's good to know what the defining characteristics of a project are. I think the 65Org32 is aimed at massively increasing data space but keeping a simple programming model. I don't know what the clock speed target might be, but 14Mhz fits well with static RAM speeds: in that case the speed of the CPU is probably not the limiting factor.

With the free xilinx tools and the freely available 6502 designs out there, one could experiment with size and speed of implementation without committing any money. With the 6502 emulators out there, one could experiment with the programming model and the toolchain, and determine code density and cycle counts. (Bringing up a Forth for example, by hand-assembling or by porting an assembler.)

For both of those ideas, the closer the machine looks to a 6502 the easier the task. The 32-bit byte helps a lot. The pincount bothers me, but it shouldn't: 84 pins is just about enough for this, and 144 is plenty.

Cheers
Ed

Nightmaretony · Post by **Nightmaretony** » Fri Jul 24, 2009 2:24 pm

the one helluva nice thing about the Xilinx package as well as many others is simulation. Perhaps we can begin by simply taking a 65C02 core and upping the data and address count first, THEN adding in the improvements and upgrades we want to?

moonshine · Post by **moonshine** » Sat Jul 25, 2009 8:15 am

GARTHWILSON wrote:

.

Quote:

Putting on the asbestos suit..

I hope I don't come across as flaming, but I don't want the obligation to "be nice" to keep us from making our points. I don't want anyone else to back down either if they can see that I still didn't get their point.

Thanks for your insight. I was just trying to be funny by referring to the asbestos suit

I don't have much hardware design experience, so I was already pretty sure some of my ideas are inconvenient to implement.

BigEd · Post by **BigEd** » Sat Jul 25, 2009 9:25 am

I've just realised that 65Org32 differs from 6502 crucially in that addresses are just one 32-bit byte.

So the JMP/JSR/RTS/RTI family only push/pull one address byte. Same for the vectors at top of memory.

Similarly, the zero-page indexes change their nature: a single zero-page location is enough to hold an address.

This is all good from a cycle count perspective, but changes the implementation quite a bit.