6502.org • View topic - Improving the 6502, some ideas

View unanswered posts | View active topics

Board index » 6502.org Users Forum » General Discussions

All times are UTC

Improving the 6502, some ideas

Page 2 of 13

[ 186 posts ]

Go to page Previous 1, 2, 3, 4, 5 ... 13 Next

Previous topic | Next topic

Author

Message

BigEd

Post subject:

Posted: Mon Jun 22, 2009 4:01 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

For VBR's original question - what small delta would you make to 65c02 - I'd start with the '816's B accumulator, and then I'd take some other goodies from the '816. I'd prefer separate opcodes for 16-bit operations, if there were any. I don't much like the modes.

Relocatable code would be good, so I'd throw in BSR for a relative subroutine call, and I'd allow 16-bit offsets for that and for branches. Stack relative sounds good too. Where this fits in the opcode map I don't know: perhaps I'd allow myself prefix bytes.

I'd probably want to sneak in an 8x8 multiply, if I could afford it.

Considering Garth's larger scale suggestions, I'd want to focus on what the implementation might be. If it's to fit a processor into a 40-pin or 44-pin programmable part, then memory will be external and there's not much room for wider address or data. If it's to embed inside a huge FPGA, memory might be internal and efficient use of it would be important. But pincount wouldn't matter.

If you're not thinking of some particular implementation, a paper sketch or an emulation has no limits to what you can throw in. Arguably one of the ARM's original weak spots for implementation was such a rich instruction set that the register file needs an immoderate number of ports. (In fact thinking of the ARM as a reinvented 32-bit 6502 isn't a bad model, if you look at the history.)

The 6502 should be just about implementable in a cheap part which can be put on a breadboard, so I'd go for that first and then add some minor tweaks.

I'd probably throw out some compatibility and try for fewer unused bus cycles, on the guess that memory access will limit performance. A wider data bus could improve performance but you'd need a careful approach for I/O devices. I'd like to consider a 16-bit data bus with byte enables, faithful to byte-addressable memory, but able to do twice as much per access in the best case. (But have I got the extra pins?) A single RAM chip is attractive, otherwise why not a 32-bit data bus, and go for an 84-pin part.

Overall, as a matter of personal preference, I'm thinking small scale single-CPLD incremental improvements, rather than a 32-bit re-architecting.

Top

VBR

Post subject:

Posted: Mon Jun 22, 2009 5:22 pm

Joined: Tue May 29, 2007 1:29 pm
Posts: 25

BigEd wrote:

A "B" register isn't a bad idea. Could be used for temporary storage and as an implicit source for new arithmetic instructions (A = A op B)

BigEd wrote:

Relocatable code would be good, so I'd throw in BSR for a relative subroutine call, and I'd allow 16-bit offsets for that and for branches.

Yeah, the 65CE02 had that.

BigEd wrote:

Stack relative sounds good too. Where this fits in the opcode map I don't know: perhaps I'd allow myself prefix bytes.

That's why I like the idea of a stack-relative mode; it conserves opcode space. If you are using a locals-oriented programming style (or a C compiler), you would set the mode and leave it set. If using a globals-oriented style, you would use it rarely, if at all.

Top

BigEd

Post subject:

Posted: Mon Jun 22, 2009 6:53 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Interesting - I must have seen the 65CE02 mentioned here but I never looked into it. I'll have a read of the datasheet - it might be that it's quite close to my ideal upgrade (for a home built re-implementation pipedream!)

Some time ago I see Garth said "There were not very many 65CE02's made, and the '816 was a much better upgrade to the 65c02 than the CE02 was, so I'm glad to have that." I'm in broad agreement in that I can buy an '816, but I wouldn't want to try to implement it.

The B register is handy though: XAB/XAB feels better than PHA/PLA. But saving everything in an interrupt handler is painful in the 816 - I'd be tempted to propose a push everything / pull everything. It's a problem when adding registers or making them wider, with only an 8bit data bus.

Top

GARTHWILSON

Post subject:

Posted: Mon Jun 22, 2009 7:46 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Quote:

I'd prefer separate opcodes for 16-bit operations, if there were any. I don't much like the modes.

The only reason I can think of offhand that the '816 can't be in 16-bit mode for both A and index registers full-time is that reading or writing them will take two consecutive addresses since the data bus is only 8-bit, which sometimes will give unwanted results. But along the same lines:

Quote:

A wider data bus could improve performance but you'd need a careful approach for I/O devices.

On the 6502, I've addressed 4-bit I/O devices too (actually an RTC) and even 1-bit (actually just a flip-flop); and addressing these with an 8-bit bus was no problem. Having a 6522 on a 32-bit bus simply means 24 of the bus bits don't get connected, as I mentioned before.

Quote:

Relocatable code would be good, so I'd . . .

The '816 isn't perfect for relocatable code, but it is orders of magnitude better for it than the 6502 is. On the 6502, if a program gets moved after it's already loaded and running (which probably won't happen without multitasking), JSR (and BRK, if you even want to count that) is the only way to get the PC (as it puts it on the stack) to add or subtract an offset for long-relative branching or data access. I wrote some routines to try relocatable code, and while the program part of it wasn't so bad (because you don't branch or jump that often), the big killer was the constant access to variables or I/O-- and that definitely killed the idea. If the PC and all other registers are all the same size (my preference being 32-bit) and in a set of registers as I proposed earlier, the PC becomes just as accessible as the rest.

Quote:

Stack relative sounds good too. Where this fits in the opcode map I don't know

The '816 puts all the stack-relative op codes in column 3.

Quote:

Considering Garth's larger scale suggestions, I'd want to focus on what the implementation might be. If it's to fit a processor into a 40-pin or 44-pin programmable part

I was thinking more like an 84-pin, in order to get 32 address bits, 32 data bits, and the other signals. Oh, I see you got to that:

Quote:

A single RAM chip is attractive, otherwise why not a 32-bit data bus, and go for an 84-pin part.

RAM chips and modules also come in 2- and 4-byte-wide data buses.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

GARTHWILSON

Post subject:

Posted: Mon Jun 22, 2009 7:53 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Quote:

I'm in broad agreement in that I can buy an '816, but I wouldn't want to try to implement it.

...But saving everything in an interrupt handler is painful in the 816

If you really only want an upgrade to the '02 and only plan to use the first 64K of memory space, keep in mind that you don't have to latch, decode, or use the high address byte on the '816 (ie, you can totally ignore that it even exists), and you don't ever have to change or save the bank registers. So in that situation the '816 has none of the drawbacks that people think it does. You could even leave A, X, and Y in 8-bit mode all the time and still take advantage of a lot of new instructions and addressing modes, movable zero page, much deeper stack space, etc.. Hmmm... I suppose you could even use the bank registers as extra registers for temprorary storage if your hardware ignores A16-A23.

The outer interrupt handler in my '816 Forth consists of:

Code:

setirq: ACCUM_16
        PHA
        . . . (five instructions here, the meat of the routine)
        PLA
        RTI

Really quite painless. ACCUM_16 is just a one-line macro that assembles REP #00100000B, but it is much more clear to humans that it's putting the accumulator into 16-bit mode.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

GARTHWILSON

Post subject:

Posted: Tue Jun 23, 2009 12:35 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

I didn't post the following earlier because I didn't want to monopolize the topic, but I should probably reply to some earlier comments before they get too far back.

And VBR, I hope my hitch-hiking on your "improving the 6502" with the 32-bit 6502 idea doesn't come across as hijacking the thread-- it's just that you brought up something I was anxious to talk about, or at least it was close. I can start a new 32-bit 6502 topic if you prefer.

I had hoped that some of those who know more than I do about programmable logic and processor design would join in. I wonder if they're just not checking the forum that often, or are just too busy to post so far.

Quote:

Regarding "memory is super cheap now"...when it was a struggle to fit a CPU on a chip, it might have made sense to waste memory in exchange for a simpler CPU. But when the entire system, including memory, is on a single chip, it makes no sense to waste memory, since memory is the biggest cost.

I'm not sure what you're saying. Today PC microprocessors commonly have a couple of megabytes of onboard cache. Microcontrollers OTOH have very little RAM-- often well under 100 bytes, because of architecture limitations and intended uses, not so much the cost, as witnessed by a quick look in the Digi-Key catalog which shows a PIC32 microcontroller with 512KB (or should that be 512K 32-bit words?) of onboard program memory and 32KB of RAM (and 85 I/O pins) going for about $6 in qty 100. I do wish we had a wide array of 65-based microcontrollers with onboard user-programmable (E)EPROM or flash program memory; but alas, no such luck right now. I'm sure making our own processor in programmable logic will require forgoing all onboard memory except for processor registers.

But as for outboard RAM, when I made my first 6502 computer (1984-'85) an 8Kx8 SRAM was $40 at Jameco. Today, the same number of dollars (ie, without even adjusting for inflation) will buy close to a million times as much memory, and it's much faster than what I got back then. Granted, that's DDR DRAM which can't be interfaced the same way, and we will have to decide if or how to handle that. We don't want to be locked in to something that's here today and gone tomorrow before the new processor is operational. I suppose external DRAM management is a possibility, but that would probably add a lot of unnecessary complication for the user; and the whole point here is to get a lot more computing power without abandoning 6502 simplicity.

The most wasteful of my propositions so far is using a 32-bit "byte" for each text character. That would just mean the entire Bible, in uncompressed text, would take about 20MB (ie, 5MW) instead of 5MB. That's still not much for today's memory. Without compression (ie, in raw mode), you can't fit even one picture from a 10MP digital camera in that amount of memory.

In spite of how critical I used to be of memory waste in PC bloatware, the greater need for memory today seems to be for the data. My experience writing the '816 Forth kernel pleasantly surprised me with the dramatic reduction in the number of assembly-language instructions required to write the primitives that handle the mostly 16-bit and occasionally 32-bit cells; so I expect that when implementing higher-level languages on this processor, the program-memory penalty will be minor, or even insignificant, because the number of instructions required to do the job will be small compared to doing the same thing with the 6502.

Quote:

For a name, 65Org32? or 65Virtual32? (brat here this morning)

It does seem appropriate to give 6502.org the credit and ownership.

Quote:

6502 would be emulatable by programming, so you get it going there, not a lost feature at all.

Perhaps by something like a translating assembler? or even an interpreter? That should be easy enough, although timings (cycle counts) won't be the same.

Quote:

While we know and love our 6502 registers, what would be the advantage in a register bank in RAM similar to the hardvard series by setting a block of RAM as a register set? Think about like setting up a stack, but havig it possible to do register manipulation on it. For example, a ROL (register#, repetition) with the register# calculated from the previously set stack location plus the offset?

I mentioned the transition to this kind of register set possibly being rather transparent because of all the 65Org32 registers being 32-bit and being able to handle any value. Consider (and this is a list of examples, definitely not complete):

The program-bank and data-bank registers simply become offsets, instead of defining banks that are limited to 64K boundaries like the '816 does.
The program counter is another register that's as readable as any other, and writing to it constitutes a jump.
X & Y index registers also are 32-bit, allowing indexing into the entire memory map.
The direct-page register, if still used, is 32-bit, meaning it would just be another offset you could optionally use if you don't want absolute addressing or the program-bank or data-bank offsets. Essentially it works the same way.
The stack pointer is also 32-bit, allowing stacks to reside near the programs they service. In stack-relative addressing, it works like an offset, to which you add an index.

I started to see that with so much similarity forming between the various registers, now we could add other ones (possibly for a total of 16) to use for additional stacks, which also takes care of the NEXT instruction (that's the name of it) in indirect-threaded Forth and which I belive is basically the same as is used in some other high-level languages. For most things, certain registers would still be used all the time for the same thing; for example, A, B, and C (like the 816's A, B, and C accumulator registers) could be registers 00, 01, and 02, X and Y could be 03 and 04, S could be 05, P could be 06, PC could be 07, etc.; and although they could be addressed by register like LD R0,16(R02), you'd normally say LDA 16,X to get the same thing; but now you have more flexibility. I see it as expanding, not replacing, 6502 operation and capabilities.

other examples:

On the 6502 we can test and branch on the various bits in P. Although the same things that normally set or clear the various status bits will still operate on the same register, now we also just as easily test and branch on bits of any register.
Want an additional index register? You've got it.
Did you ever wish you could LDA (A), (ie, an indirect without first storing A in memory), or do LDA (A),X? Now you can.
Want additional stacks without losing your X or S pointers? No problem. Use uncommitted registers. (In addition to the return stack, I'll go for a data stack, and maybe also a complex-number stack, floating-point stack, and string stack.)

I don't think it really makes the instruction decoding much more complicated, although I haven't tried to work out details of how it would work. If I'm not understanding the register set correctly I hope Samuel or someone will jump in and clarify or fix, as appropriate.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

VBR

Post subject:

Posted: Tue Jun 23, 2009 12:47 am

Joined: Tue May 29, 2007 1:29 pm
Posts: 25

BigEd wrote:

The B register is handy though: XAB/XAB feels better than PHA/PLA.

I was thinking it might be good if LDA toggled between A and B. As in:

lda #1 ; A = 1, B = ?
lda #2 ; A = 2, B = 1
lda #3 ; A = 3, B = 2

The advantage is that you don't need an LDB instruction.

Top

BigEd

Post subject:

Posted: Tue Jun 23, 2009 7:23 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

An on-chip register stack? See the transputer and Forth Chips

One of the implementation points to look out for is the complexity of the processor handling interrupts. It looks like the 65CE02 doesn't take interrupts between single-cycle instructions. That'll be a consequence of the more aggressive pipeline, I think.

I think the 6502 and perhaps the 65816 are both architected around the time cost of an 8-bit addition: hence the extra cycle for crossing page boundaries. These days we can probably do a 32-bit addition in the time it takes for one memory access, if we stick to small static-RAM memory. DRAM modules are another question: I think the memory controller alone could quickly become more complex than an original 6502.

Edit: spello

Last edited by BigEd on Sat Apr 16, 2011 4:57 pm, edited 1 time in total.

Top

ElEctric_EyE

Post subject:

Posted: Tue Jun 23, 2009 10:40 am

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

I would like to see more speed. Even if no new instructions or add-ons. Just RAM/EEprom with the CPU on a single die, and 100+ MHz.

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502

Top

kc5tja

Post subject:

Posted: Tue Jun 23, 2009 2:43 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

VBR wrote:

BigEd wrote:

The B register is handy though: XAB/XAB feels better than PHA/PLA.

I was thinking it might be good if LDA toggled between A and B. As in:

lda #1 ; A = 1, B = ?
lda #2 ; A = 2, B = 1
lda #3 ; A = 3, B = 2

The advantage is that you don't need an LDB instruction.

Going on the issue of "toggling",

How do you know which accumulator LDA affects? What if some code branches to the above code, and the "phase" is reversed from expectation?

You always want CPU instructions to be as explicit as possible. RISC has taught us as much. Functional reconstitution of assembly language code, a method instrumental in figuring out just what the heck some inscrutible code actually does, depends heavily on it. Having a LDA and LDB is overwhelmingly more preferable to random guesswork.

Thankfully, what you illustrated above is not toggling, but rather is clearly a "data stack" in the Forth processor parlance. You have, simply, a 2-deep stack.

However, this, too, yields quite the problem too -- of what value is B when it's always being overwritten by the implicit-DUP with each LDA? I can think of numerous, and very frequently executed, code sequences where you want B's value to remain constant as you load another value into A for processing.

In either case, I must shoot these ideas down as impractical for the 6502's architecture.

If you want a genuine stack architecture, you just plain have to design the CPU, from the get-go, to support stack operations.

Top

Nightmaretony

Post subject:

Posted: Tue Jun 23, 2009 2:53 pm

Joined: Fri Jun 27, 2003 8:12 am
Posts: 618
Location: Meadowbrook

Garth: perhaps a new forum for the 65Org32 on here? That way, the various subjects can have their own threads, such as basic architecture, real world interfacing, opcodes, VHDL design, etc.

Yup on the emulation of the 6502 though it would kill cycle counts UNLESS the emulator is designed to throw in delay loops from a table for the various instructions, to try and emulate the timing. I am not sure how MAME handles the 6502 cycling, but they have a C core that seems to work out fairly well for arcade graphics.

Memory controller: maybe a small controller circuit portion that can be updated when required?

also on the data, a friend had used a command in his game design called PackBytes and UnpackBytes. Perhaps an assembler macro to sort into 4 8 bytes in a row kind of deal? this way, you do not have to encode as a new instruction or bitwise mode. There is a definite simplicity in keeping it all in 32 wide mode.

PS: you are NOT monopolizing, you got a lot of ideas towards this wich will form a solid base for the design....

_________________
"My biggest dream in life? Building black plywood Habitrails"

Top

kc5tja

Post subject:

Posted: Tue Jun 23, 2009 2:53 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

GARTHWILSON wrote:

Assuming you're reading from/writing to RAM, this is a non-issue.

The reason you don't always want to sit in 16-bit mode is simply the issue of carry-over. Manipulating a series of 8-bit bytes (not always strings, mind you) as such is not only faster, but completely eliminates the issue of stray carry over.

For example, working with graphics data is a common case. Imagine you're blitting a mouse pointer image a few pixels to the left or right. Since the mouse image is typically a 16x16 bitmap, you need a multi-precision shift. PROBLEM: nearly ALL graphics chips shift MSB first, not LSB, and are big-endian, since it yields the simplest possible logical mapping of bits to pixels on the screen. However, the CPU's shift/rotate instructions assume the opposite bit and byte ordering -- thus, using ROR/ROL/LSL/ASR on a word-sized piece of byte-sized bitmap data will result in garbage graphics!!

Quote:

Having a 6522 on a 32-bit bus simply means 24 of the bus bits don't get connected, as I mentioned before.

Not necessarily. Assuming your 32-bit CPU has byte-addressing capability, that 32-bit bus will also have 4 byte-enable lines, which can also contribute to the address decoding logic. Thus, it's possible to interleave your 6522s so that all port As are adjacent in memory, all port Bs are adjacent, etc. If you're ganging multiple 6522s to emulate a "wider 16- or 32-bit I/O chip," this can be a useful technique.

Quote:

Even with multitasking, the image stays fixed in RAM. The Amiga demonstrated an excellent solution -- include address fixup meta-data in your binary loader format, so that the loader can repair broken operands. This allows you to assemble your code to a fixed virtual address ($00000000 in the case of AmigaOS; I believe $00000100 for Atari TOS), while the OS is capable of relocating the image upon load-time. Once loaded, of course, it's fixed in RAM.

However, through careful design, you can dynamically relocate code and statically known addresses for data and variables (but not data allocated at run-time!) by simply traversing the relocation metadata after moving the code chunk.

Top

VBR

Post subject:

Posted: Tue Jun 23, 2009 6:23 pm

Joined: Tue May 29, 2007 1:29 pm
Posts: 25

kc5tja wrote:

Thankfully, what you illustrated above is not toggling, but rather is clearly a "data stack" in the Forth processor parlance. You have, simply, a 2-deep stack.

There's no difference between an exchange of A and B, followed by LDA, and a transfer from A to B, followed by LDA.

kc5tja wrote:

However, this, too, yields quite the problem too -- of what value is B when it's always being overwritten by the implicit-DUP with each LDA?

Say you want to add a multiply operation...

lda operand1
lda operand2
mul ; A*B

kc5tja wrote:

I can think of numerous, and very frequently executed, code sequences where you want B's value to remain constant as you load another value into A for processing.

Without the implicit XAB (exchange A and B), you would need more explicit XAB instructions.

Last edited by VBR on Tue Jun 23, 2009 7:14 pm, edited 1 time in total.

Top

kc5tja

Post subject:

Posted: Tue Jun 23, 2009 7:09 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

VBR wrote:

There's no difference between an exchange of A and B, followed by LDA, and a transfer from A to B, followed by LDA.

I know. You misunderstood what I wrote.

Quote:

Say you want add a multiply operation...

lda operand1
lda operand2
mul ; A*B

I'm intimately familiar with stack processors and the Forth language, having designed CPUs and implementations of the language in the past.

The problem is not with that you're advocating a stack. The problem is that the rest of the CPU utilizes all explicit operations, registers, etc. (Contrary to popular belief, "implicit" register operands are still explicit, since the use of those registers is still an explicit part of that instruction's mathematical semantics.)

When I see LDA, I see this as Load Accumulator. Of which there are two. I cannot look at any single LDA and determine what its effects are. Saying "A goes to B" is only partially correct; the ambiguity happens when you don't know what A had originally, thus you no longer know what is in B, and yet, is vital to debugging the program. When dealing with a register-oriented architecture, of which the 6502 is a degenerate instance of, this situation happens all too often.

The fix for this is to have a separate instruction. LDA touches A, and only A. To get the stack effects you are looking for, use a new instruction, such as LAS -- Load Accumulator Stack. The fact that this is a whole new instruction means that the semantics are unambiguous, and provides a ready indicator to the code reviewer that additional context is needed to determine the value of B.

In all honesty, though, implementing LAS won't offer a significant time advantage to XBA/LDA, since the 6502 has only a single bus inside it. Far more important to mee are instructions that add (and/or subtract) immediate values to X, Y, and S.

Without this facility, even implementations of Forth spend unnecessary amounts of time tweaking the registers instead of executing the user's program.

Quote:

Without the implicit XAB (exchange A and B), you would need more explicit XAB instructions.

On a register machine, this is actually an advantage, not a disadvantage, as it greatly simplifies the internal design of the CPU (read, runs at lower power, has fewer transistors, can run at higher clock frequencies, etc). Also, you'll find that XAB would occur only around those instructions that actually use the B register.

I've been coding 65816 software for several years, and I've used the B register exactly once, in the middle of a graphics blit routine. Most other uses were far more easily expressed using direct page in 16-bit native mode.

But, like I said earlier, on a dedicated stack architecture, where the CPU is expressly built for such a mode of operation, the rules change.

Top

VBR

Post subject:

Posted: Tue Jun 23, 2009 7:30 pm

Joined: Tue May 29, 2007 1:29 pm
Posts: 25

kc5tja wrote:

In all honesty, though, implementing LAS won't offer a significant time advantage to XBA/LDA, since the 6502 has only a single bus inside it.

You don't have to physically move data from one register to another; you can just rename the accumulator. That's why I used the word "toggle", because I thought of it as a renaming.

kc5tja wrote:

Far more important to mee are instructions that add (and/or subtract) immediate values to X, Y, and S.

Without this facility, even implementations of Forth spend unnecessary amounts of time tweaking the registers instead of executing the user's program.

Indeed, having worked on the cc65 compiler I'm aware of how painful stack operations are.

kc5tja wrote:

But, like I said earlier, on a dedicated stack architecture, where the CPU is expressly built for such a mode of operation, the rules change.

Personally I don't see a huge difference between an accumulator CPU and a stack CPU. They are close cousins in my view.

Top

Page 2 of 13

[ 186 posts ]

Go to page Previous 1, 2, 3, 4, 5 ... 13 Next

Board index » 6502.org Users Forum » General Discussions

All times are UTC

Who is online

Users browsing this forum: No registered users and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum