6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 7:40 pm

All times are UTC




Post new topic Reply to topic  [ 186 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 13  Next
Author Message
 Post subject: cost of a wider word
PostPosted: Sat Jul 25, 2009 9:57 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
On the implementation front, the first thing to check might be the relative speeds of an 8-bit and a 32-bit ALU. As we know, 6502 is almost entirely 8-bit and takes extra cycles to carry into the high byte. (The sole exception being the increment of PC, I think.)

Note also that the ROMs would have to be word-wide, unless there was a byte-wide bootstrap mode.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 25, 2009 9:59 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
It might be worth collecting and ranking the various ideas which enhance the 6502 but increase the complexity somewhat. Complexity costs: in time to implement, to verify, to document and to add into the toolchain. It also may cost clock speed and gate count. (As we have CPLDs and FPGAs, the gate count cost matters only if we're overflowing some particular part and having to move up to a bigger more expensive one. For a custom chip, gate count costs area which hits yield. Chip cost is an exponential function of chip area. The original 6502 had impressively low gate count.)

Off the top of my head, for 65Org32, I quite like
    predication (every instruction can be conditionally executed)
    including branch offsets in the opcode
    other small constant signed operands included in the opcode
    including an ASL8 and maybe an ASL24 or ASL16
    8x8 multiply (or 32x32 if on an FPGA with hardware multipliers)
and I also like, but would defer:
    removing dead cycles
    stack based address mode
    RL register so all accesses are relative: free relocation
    B accumulator (or just a B register)
    Z register
    relative JSR
    putting a 16 or 24-bit address into an opcode for a new sort of zero-page
    barrel shifter
    prefetch buffer
    top of stack held on chip
    on chip cache
    indirect JSR
    block move
    B and maybe C reg as a push-down stack: just need a POP or ROT, implicit push on every LDA. Maybe a DUP.
    some byte permutations


For some choices, it matters whether or not the core is running faster than external memory, and whether or not the data bus is narrower than the operands (or opcode+operand). For a memory system which isn't simple, there are questions about flushing cache, reading back data just written, uncacheable accesses, supporting self-modifying code, and what alignment restrictions might apply. I'd defer all that.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Jul 25, 2009 11:19 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
I've just realised that 65Org32 differs from 6502 crucially in that addresses are just one 32-bit byte

which is why there are no zero-page limitations. Now you can use any address for indirects, without having to be on a tight budget for ZP (or DP) use, because all of memory-- all 4,294,967,296 addresses-- are in ZP. Page 1 where the hardware stack resides could also be seen as having 4,294,967,296 addresses (although they're the same ones), so your stack depth is unlimited. Ruud wanted to be able to put long strings and other large things on the hardware stack for his Pascal compiler he was writing. Now he could put a whole movie on the stack if he wanted to!

Quote:
This is all good from a cycle count perspective, but changes the implementation quite a bit

in a nice way, huh?! It's simpler.

Quote:
On the implementation front, the first thing to check might be the relative speeds of an 8-bit and a 32-bit ALU. As we know, 6502 is almost entirely 8-bit and takes extra cycles to carry into the high byte.

Yes, and it has especially been a pain on the 6502 when incrementing a multi-byte number by say 2 when you don't know if the beginning number will be even or odd. I can think of several ways to do it, but none get much better than just doing a quad-precision "add 2":
Code:
        LDA   ADDR
        CLC
        ADC   #2
        STA   ADDR
        LDA   ADDR+1
        ADC   #0
        STA   ADDR+1
        LDA   ADDR+2
        ADC   #0
        STA   ADDR+2
        LDA   ADDR+3
        ADC   #0
        STA   ADDR+3

which is nowhere near as good as the 32-bitter's
Code:
        INC   ADDR
        INC   ADDR

or, per your earlier suggestion, simply
Code:
        INC2  ADDR

However, this might be a situation where we don't want to compare apples to apples. The 6502 must do the multiple INCs or DECs in order to move to the next pair of bytes holding an address, which is the typical use for this procedure. For the 32-bitter OTOH, a single INC or DEC is sufficient to move to the next memory unit containing a complete address.

Quote:
Note also that the ROMs would have to be word-wide, unless there was a byte-wide bootstrap mode.

Yes, I was thinking of four 8-bit-wide ones, or two 16-bit-wide ones, or, something I've been wanting to do even with the 6502 or '816 is to have something pre-load RAM with the boot-up code before releasing the processor from RST. Then you don't need any slow ROM on the bus at all.

Quote:
It might be worth collecting and ranking the various ideas which enhance the 6502 but increase the complexity somewhat. Complexity costs: in time to implement, to verify, to document and to add into the toolchain.

For some things, we'll have to discuss it with whoever does the HDL design, to evaluate tradeoffs. For the "toolchain" I was thinking of using Universal Cross Assemblers' C32 assembler which comes with the files to assemble for dozens of different processors (no need to buy another assembler for each) and gives you the tools to write files for new processors. Once I have my Forth kernel going with that, I'll write an assembler in Forth that will run on the target like I have on my current system except that it will be more powerful.

Quote:
Off the top of my head, for 65Org32, I quite like
  • predication (every instruction can be conditionally executed)

Can you give examples and show what makes them useful.

Quote:
  • including branch offsets in the opcode
  • other small constant signed operands included in the opcode
  • including an ASL8 and maybe an ASL24 or ASL16

This would make for more-compact code, and would especially be appropriate for shift and rotate distances (which covers one of the next things you mention), although I expect it might make for more-complex and maybe slower instruction-decoding. The usual solution is to having a deep pipeline to be working on the decoding before the instruction's turn comes up to be executed, but then we're talking about more complexity again. OTOH, the input clock could be four or eight times the bus clock in order to get more ticks to accomplish things for each memory cycle, like having a 40 or 80MHz input clock for a 10MHz phase-2 output. This should eliminate the dead bus cycles which you also mentioned later.

Quote:
  • 8x8 multiply (or 32x32 if on an FPGA with hardware multipliers)

For my uses, I really want the 32x32, which is where the B accumulator comes in which you also mentioned. It wouldn't bother me too much if the processor had to be in two or three ICs instead of all fitting into one.

Quote:
and I also like, but would defer:
  • stack-based address mode
  • RL register so all accesses are relative: free relocation

The 65816 already has these, except that its "relative" part is the program bank and data bank registers which I would want to extend to 32-bit so there's no constraint to 64K blocks. The offsets can be any 32-bit number. As with the '816, you're not forced to use them though. Absolute addressing uses them but long does not. (Both abs and long will be 32-bit here though.) The 816's DP register gives another offset for the movable zero page within the first 64K bank; but now that too will be free to be anywhere in the 32-bit address space.

The stack-based address mode, if I'm understanding you right, is called "stack-relative," as in the "CMP 2,S" in one of my listings on the last page, which means "compare accum to what is 2nd from the top of the stack," regardless of where the top of the stack is at the moment.

Quote:
  • Z register

The 65CE02 had this, and it was initialized at 0 at RST. If we want another register for miscellaneous use, it would probably be good to differentiate it from a constant zero so there's no confusion if you still want to store zero (STZ, which the 65c02 and '816 have) versus store the register's content.

Quote:
  • relative JSR

Yes, unless we find that the relative-address registers mentioned two paragraphs up make it unnecessary. The '816 can jerry-rig this far more easily than the 6502 can, but it still lacks improvement.

Quote:
  • putting a 16 or 24-bit address into an opcode for a new sort of zero-page
  • barrel shifter
  • prefetch buffer

discussed above

Quote:
  • top of stack held on chip

I see advantages and disadvantages. Can you elaborate?

Quote:
  • on-chip cache

getting more complex again, and makes it hard to pre-determine peformance.

Quote:
  • indirect JSR

The '816 has JSR (addr,X), and adding JSR(addr) might make sense. As it is now, it can be synthesized for the few times it's needed, with PER followed by a JMP indirect. Pretty simple.

Quote:
  • block move

Absolutely, as the '816 already has it. The '816 takes 7 clocks per byte though, and I think we can do better, at least if the input clock is a multiple of the phase-2 output.

Quote:
  • B and maybe C reg as a push-down stack: just need a POP or ROT, implicit push on every LDA. Maybe a DUP.

Do you have a particular use for a HLL in mind? If the processor still has to go out on the bus to update the stack in memory to reflect what's in its registers, or refill its registers from memory, won't there be a performance hit that kind of negates the benefit?

Quote:
For a memory system which isn't simple, there are questions about [...] supporting self-modifying code [...] I'd defer all that.

I do use self-modifying code in my ITC Forth's inner loop to do a double indirect more efficiently, making the operand to also be a variable. Keeping the operand separate from the op code helps immensely there. I suspect other HLLs do the same kind of thing. This is one of the reasons to stay away from the separate instruction and data memories of a Harvard architecture.

I got permission from Phil Pemberton, the owner of the 6502 Yahoo forum, to put an invitation there to have others join in this discussion since there has been virtually no activity on that forum recently (8 posts in the last 3 years). Hopefully we'll attract a couple of HDL designers who can help us turn the talk into hardware. I'll be posting the invitation soon.

Although it's not a 65-family processor, take a look at the home-made TTL CPU project at http://web.whosting.ch/dieter/trex/trex.htm . The link was sent by someone who would rather lurk and not post. The CPU occupies multiple 6U VME form factor boards, with mezzanines. Very impressive and nicely done, although never quite finished, and not what we're looking to do here. Edit, 7/24/12: That's Dieter, and Mike just posted his updates at http://6502.org/users/dieter/index.htm . There is a section there on the 6502 ALU.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jul 26, 2009 6:30 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
.
One thing that has not been mentioned-- is the added design complexity of the decimal mode warranted?

I haven't had any use for it in my Forth because internally it only keeps numbers in hex, and if it needs to convert the base for I/O, the conversion method works the same regardless of base, without using the decimal mode.

But some might still want the decimal mode. Maybe Lee, for EhBASIC? (I don't know how it stores and deals with the numbers internally.) How 'bout you Ruud, for Pascal? Samuel, with any of the dozens of languages you have under your belt? (I hope the heroes of the forum haven't given up reading this voluminous topic.) I don't imagine the decimal mode is very useful for Tony's game machines. Bruce, in the algorithms dept?

Someone said decimal was needed in the financial field because there was no way to precisely represent one cent as a hundredth of a dollar in hex (%0.00000010100011110101111 is about a millionth of a cent off, if I figured it right); but after working plenty in scaled integer and fixed-point arithmetic, I don't see any problem with it if internally you just make the cents the units and a dollar a hundred units (or scale it finer if you want to). Then a cent is exact.

I did use the decimal mode in a product I programmed 20 years ago; but that was in assembly language without the benefit of a HLL. I wouldn't do that kind of job again in assembly though.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: decimal mode support
PostPosted: Sun Jul 26, 2009 8:59 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
GARTHWILSON wrote:
.
One thing that has not been mentioned-- is the added design complexity of the decimal mode warranted?


In general, I'd prefer not to have modes. So I'd drop the feature, or add a Decimal Adjust. If not dropping it, it's a verification problem, and I've seen mention of bugs.

Specifically, if 65Org32 is built by modification of an existing 6502 design, it might be easier to retain decimal mode. The 6502 takes an extra cycle to do the decimal adjust implicitly, whereas a DAA might take two cycles to do it explicitly.

I suppose decimal mode was put in to support point-of-sale applications where code size would be very important, and I imagine it cost relatively few transistors. In 65Org32 I think it's an unwanted extra.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 26, 2009 9:17 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
GARTHWILSON wrote:
Quote:
Off the top of my head, for 65Org32, I quite like
  • predication (every instruction can be conditionally executed)

Can you give examples and show what makes them useful.


See this wikipedia article

If a short forward branch costs 2 or 3 cycles (and one or two words) to skip over a single instruction, then using a predicated instruction will cost 0 or 2 cycles (and saves one or two words):
Code:
   BCC skip
   INC highword
skip:   # arrive here generally in 3 cycles, otherwise in 4

is replaced by
Code:
   INC.CS highword
skip:   # arrive here always in 2 cycles (one is dead)

(first example which came to mind, even though multi-word operations isn't likely to be a convincing case. Sign-extending a byte would be a better example, or the condition add in a shift-add multiply, possibly the conditional add in a non-restoring division.)

On consideration, this is a win in a pipelined machine where a branch costs more: we have the encoding space but it might not be worth it.

So I'd now drop this into the defer pile.


Last edited by BigEd on Sun Jul 26, 2009 9:46 am, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 26, 2009 9:23 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
GARTHWILSON wrote:
Quote:
  • B and maybe C reg as a push-down stack: just need a POP or ROT, implicit push on every LDA. Maybe a DUP.

Do you have a particular use for a HLL in mind? If the processor still has to go out on the bus to update the stack in memory to reflect what's in its registers, or refill its registers from memory, won't there be a performance hit that kind of negates the benefit?


I was thinking of two things:
    1. an evaluation stack, like an HP calculator. But that was a mistake: we don't have and I don't suggest an addressing mode to take operands off this stack.
    2. automatic temporaries, like the B register but more automatic. Some code is using A, needs to do something which would need another register, and instead of PHA/PLA or similar it can just use A and then POP.

If it turns out that 65816's B register is never used, this is low value. I wasn't thinking of this small onchip stack as being at all related to the SP stack.

It's already on my defer pile.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jul 26, 2009 9:36 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
GARTHWILSON wrote:
Quote:
  • 8x8 multiply (or 32x32 if on an FPGA with hardware multipliers)

For my uses, I really want the 32x32, which is where the B accumulator comes in which you also mentioned.

I was thinking one can get a faster multiplication of any size if built on top of an 8x8 primitive than if doing everything by shift and add. 8x8 is small enough to hide, whereas 32x32 done in gates would be huge. A 32x32 implemented in 18x18 primitives would be fine though: so if the implementation technology has those, use them. Otherwise, stick to 8x8. (I'm implicitly rejecting any multicycle implementation: all ALU ops must be single cycle, for simplicity.)

GARTHWILSON wrote:
Quote:
and I also like, but would defer:
  • stack-based address mode
  • RL register so all accesses are relative: free relocation

The 65816 already has these, except that its "relative" part is the program bank and data bank registers

I don't see the 65816 as helping the implementation of 65Org32 - it's interesting and it's useful, but lots of other processors are also good for ideas. We do have some 6502 implementations to use as starting points: if we're very far removed from 6502 then we have to start afresh, so count another couple of person-years for the project.

I throw in RL as a simplification and consolidation of program bank and data bank: it's more of a per-process base register, applied to all accesses, and of course is initially set to zero.

I think I'd better stop there: if I reply to everything on my defer list it'll become a very deep and wide thread!

Cheers


Top
 Profile  
Reply with quote  
 Post subject: 65Org16
PostPosted: Sun Jul 26, 2009 9:42 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
GARTHWILSON wrote:
Quote:
I've just realised that 65Org32 differs from 6502 crucially in that addresses are just one 32-bit byte

which is why there are no zero-page limitations.
Quote:
This is all good from a cycle count perspective, but changes the implementation quite a bit

in a nice way, huh?! It's simpler.


Here's an idea: the 65Org16 is a 6502 where every byte is 16bits. All addresses are now 32-bit, zero-page is 64k locations, as is stack, and everything else is modified in the way you'd expect. All the cycle counts (and therefore the instruction decode and datapath control) are unchanged.

The big win: implementation is derived very straightforwardly from an existing 6502.

Half the accessible memory of 65Org32, and less 32-bit performance. Fewer pins.

Of course you can write 32-bit software on this machine just as you could write it on any 8-bit micro, but you do two 16bit-byte operations instead of four 8bit-byte operations.

Cheers
Ed


Last edited by BigEd on Sun Jul 26, 2009 12:12 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jul 26, 2009 11:49 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
How about providing multiply and divide step instructions instead of a whole 32x32 operation?

While the step approach slows down multiply it increases the speed of divide and reduces the size of the ALU logic.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 26, 2009 4:25 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
BitWise wrote:
How about providing multiply and divide step instructions instead of a whole 32x32 operation?


It's an interesting idea, but I'd put it in my 'defer' list, because I don't think it fits into the 6502 sequencer: it would use the ALU twice and write to two registers.

Once the sequencer admits more general instructions, it might allow for a self-repeating multicycle multiply, like the block moves. With similar questions about interrupts.


Top
 Profile  
Reply with quote  
 Post subject: Re: 65Org16
PostPosted: Mon Jul 27, 2009 7:41 am 
Offline

Joined: Mon Oct 16, 2006 8:28 am
Posts: 106
BigEd wrote:
Here's an idea: the 65Org16 is a 6502 where every byte is 16bits. All addresses are now 32-bit, zero-page is 64k locations, as is stack, and everything else is modified in the way you'd expect. All the cycle counts (and therefore the instruction decode and datapath control) are unchanged.

The big win: implementation is derived very straightforwardly from an existing 6502.

Half the accessible memory of 65Org32, and less 32-bit performance. Fewer pins.

Of course you can write 32-bit software on this machine just as you could write it on any 8-bit micro, but you do two 16bit-byte operations instead of four 8bit-byte operations.

Cheers
Ed

That would definitely simplify things, but I would argue that 8 bit LDA and STA should still be allowed, in addition to 16 bit versions (the index registers are, by their very nature, probably best always kept 16 or 32 bits wide). I say this after having struggled with the 65816's m/x modal architecture, it's just too annoying after having worked with the m68K, x86 and 6809, all of which allow selecting 8 or 16 bit access for free for each opcode. Many algorithms require working with 8 bit bytes mixed with wider words (eg matrix algebra where the vectors are composed of bytes and the intermediate results are kept in 16 bit words), and having to do a bunch of shifts and masks is both a drain on performance and a source of bugs.


Top
 Profile  
Reply with quote  
 Post subject: Re: 65Org16
PostPosted: Mon Jul 27, 2009 5:22 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
faybs wrote:
I say this after having struggled with the 65816's m/x modal architecture


Likewise, so no modes here. Strings are fine, once you've understood them as a stream of bytes. The bytes just happen to be wider than 8 bits. There are no alignment or shifting requirements, because memory is not addressable on 8-bit boundaries and is not being conserved, so 8-bit values are not packed but rather zero-extended or sign-extended.

faybs wrote:
Many algorithms require working with 8 bit bytes mixed with wider words (eg matrix algebra)


There are no 8-bit bytes. If your values are small, you can hold the product in one 16-bit byte. If they are not small, you'll need some two-byte words of 32-bits, just as you used to cope on a machine with 8-bit bytes.

If you access an 8-bit peripheral, your driver (or hardware) will need to mask down to 8-bits, and maybe sign-extend. That's one or three extra instructions on each access.

In the interest of defining something which can handle 32-bit addresses but has a chance of being implemented, I'd hold 65Org16 to be a very compatible wide-byte copy of 6502. It should double the per-clock performance of 6502 when working on 16-bit or 32-bit data. It should also clock somewhat faster.

If it's ever built and anyone is interested, it could be enhanced to have a smarter sequencer, more opcodes, packed operands, etc etc. It could also be used as the base for a 65Org32 development, which would double the performance again when dealing with 32-bit data.

I think a series of implementable revisions is more likely to be a successful project.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jul 27, 2009 7:24 pm 
Offline

Joined: Fri Aug 30, 2002 2:05 pm
Posts: 347
Location: UK
Quote:
But some might still want the decimal mode. Maybe Lee, for EhBASIC? (I don't know how it stores and deals with the numbers internally.)

EhBASIC does use the decimal mode but only for converting nibbles to ASCII "A" to "F" in the HEX$() function and this could be recoded not to need it.

Lee.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jul 27, 2009 8:11 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
.
Thanks for the comment, Lee.  Good to know.

Quote:
In general, I'd prefer not to have modes. So I'd drop the feature, or add a Decimal Adjust.  If not dropping it, it's a verification problem, and I've seen mention of bugs.

The true bugs are all fixed in the CMOS ones.

Regarding predication:  The 65c02 has BBR & BBS instructions which seem to fall into that category, if I'm understanding it right.  The PIC microprocessor has things like BTFSS (Bit Test File, Skip the next instruction if the bit is Set), and I have wasted countless hours debugging because of the confusion resulting from the backwards logic that says "if this is true then don't do that."  I probably should have used macros to re-name the mnemonics long ago.  The 65c02's BBR & BBS are far more clear.  Truthfully though, I don't remember ever using them, probably because they're only for ZP.  I've given 6502 names to macros I do use regularly in my PIC programming, like BEQ, which assembles BTFSC  STATUS,Z, followed by GOTO, so I end up using two instructions and 8-12 clocks to do what the 6502 does in 2-3 clocks.  Skipping just one instruction is a pretty rare need, so branching longer distances is the norm.  Reading your Wikipedia page on it, it looks like the benefits of predication materialize on machines with branch prediction, caches, and deep pipelines (not the shallow one of the 6502).  It would be another thing we'd have to run by the HDL designer(s), but at this point I'm not optimistic about it.

Quote:
1. an evaluation stack, like an HP calculator.  But that was a mistake: we don't have and I don't suggest an addressing mode to take operands off this stack.

I use my HP-41cx every day, plus Forth which takes the data stack usage further, and am a strong supporter of such a stack.  The 65816 already has stack-relative addressing for the hardware stack (the stack that's in page 1 on the 6502), and the 6502 almost looks like it was intentionally made to use ZP for the Forth data stack using X as the pointer.  It works very efficiently considering the stack memory is outboard.

Quote:
If it turns out that 65816's B register is never used, this is low value.  I wasn't thinking of this small onchip stack as being at all related to the SP stack.

You've mentioned this before.  When I see "SP", I think of "stack pointer," but you seem to mean something different.  Is that true?  The 816's B is just the high byte of the accumulator; so in 16-bit work, it's used all the time, but not addressed as B, rather that A and B are the low and high bytes of C.  LDA in 16-bit mode is really LDC, as it loads both A and B. Even in 16-bit mode, there's the  instruction that swaps the high and low bytes.  [Edit, years later:  I really shouldn't call it 16-bit mode.  There are really only two mode bits: the decimal-mode bit and the emulation-mode bit.  In native mode, ie, '02-emulation mode is off, M and X specify register widths, not modes.]

Quote:
How about providing multiply and divide step instructions instead of a whole 32x32 operation?

While the step approach slows down multiply it increases the speed of divide and reduces the size of the ALU logic.

I wouldn't mind if the multiply and divide instructions took as long as the longest of the other instructions (maybe 5 bus cycles or so, which might mean 20 cycles of the input clock).  If necessary, it would even be ok if it took a few such instructions to complete the job.  That's still well over an order of magnitude faster than the iterative routines they would be replacing.  One thing I definitely don't want however is that it go more than 5 or so bus cycles without being interruptible.  The 68000 was terrible in that regard IIRC.

Quote:
I don't see the 65816 as helping the implementation of 65Org32 - it's interesting and it's useful, but lots of other processors are also good for ideas.  We do have some 6502 implementations to use as starting points:  if we're very far removed from 6502 then we have to start afresh

The 65816 has a lot of things the 6502 should have had initially if silicon real estate and memory weren't so expensive at the time.  Although the '816 is intimidating to those unacquainted with it, I find it actually makes programming a lot easier.  It shouldn't be ignored.

Quote:
throw in RL as a simplification and consolidation of program bank and data bank:  it's more of a per-process base register, applied to all accesses, and of course is initially set to zero.

The '816 could easily have a program in one bank that handles data in many banks; so in that case it made sense to have separate program- and data-bank registers.  Now that we're getting rid of the bank boundaries and extending the offset register to 32 bits, I don't see any problem so far with consolidating the two.

Quote:
Here's an idea: the 65Org16 is a 6502 where every byte is 16bits.  All addresses are now 32-bit, zero-page is 64k locations, as is stack, and everything else is modified in the way you'd expect.  All the cycle counts (and therefore the instruction decode and datapath control) are unchanged.

I initially considered that.  Perhaps both processors should be developed.  The one with the 16-bit data bus (65Org16? 651632?) can get away with 68 pins.  The one with the 32-bit data bus (65Org32) will need 84, so not a big difference there.  I don't think I'll be wire-wrapping it, but rather laying out a board to get made.  The fact that loading words in two sections on the 16-bitter requires the shift and stepped loading-and-storing logic might mean it will need nearly the same amount of internal logic that the 32-bitter will need.

Quote:
That would definitely simplify things, but I would argue that 8 bit LDA and STA should still be allowed, in addition to 16-bit versions (the index registers are, by their very nature, probably best always kept 16 or 32 bits wide).  I say this after having struggled with the 65816's m/x modal architecture, it's just too annoying after having worked with the m68K, x86 and 6809, all of which allow selecting 8 or 16-bit access for free for each opcode.  Many algorithms require working with 8-bit bytes mixed with wider words (eg matrix algebra where the vectors are composed of bytes and the intermediate results are kept in 16 bit words), and having to do a bunch of shifts and masks is both a drain on performance and a source of bugs.

The 65816 and other processors needed the option to load or store 8 or 16 bits because the registers' width doesn't always match the bus width.  With the 65Org32 with a 32-bit data bus and registers, I think the need mostly goes away.  The "byte" is simply 32 bits.  I say "mostly" instead of "completely" because for use with 8-bit I/O devices, we might still want an 8-bit LDA that not only ANDs what is read with $000000FF but also transfers bit 7 (instead of bit 31) to the N flag, and an 8-bit BIT instruction that additionally transfers bit 6 (instead of bit 30) to the V flag.

I never had any trouble keeping the 816's M & X flags straight though.  Maybe it was just my application.  The only development I've done on the '816 was my feature-loaded Forth.  Almost always, the index registers stayed in 8-bit and the accumulator in 16-bit mode.  Macros ACCUM8, ACCUM16, INDEX8, and INDEX16 made it clear what was happening, instead of relying on the cryptic REP and SEP instructions.  Index registers could have stayed in 16-bit mode too, except that there would be a small performance penalty from the 8-bit data bus when they had to be loaded or stored.

I posted the invitation on the Yahoo forum this morning, and heard from Tony Rudzki who had an excellent idea about the 8-bit issue.  He suggested having a particular part of memory, perhaps a 64K segment at a location that makes address-decoding easy, be for 8-bit loads, and put your I/O there.  Then we don't need extra instructions or mode-setting to get the 8-bit loads.  Update: Ed's idea of doing it in hardware might be even better-- using minor external hardware so a read of an 8-bit device repeats bits 6 and 7 into bits 30 and 31, and bits 8-29 are grounded for that read cycle.  Then the processor needs no extra complexity at all.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 186 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 13  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: