6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed Sep 25, 2024 3:23 pm

All times are UTC




Post new topic Reply to topic  [ 85 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
 Post subject:
PostPosted: Wed Nov 30, 2011 8:03 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Warning: O.T...
Arlet wrote:
ElEctric_EyE wrote:
So, am I to understand the code would look like this:

Code:
/*
 * Address Generator
 */

parameter
   ZEROPAGE  = 0,
   STACKPAGE = 0;


You could do that, but they are really "don't cares". With a 32 bit address and data bus, the "upper 32 bits" of the address wouldn't exist, and all logic driving it would be optimized away by the synthesis tools.

So if one wanted a 4K 'shared' ZeroPage/Stack on the 65Org16 for instance, one where zero page and stack crept towards each other, but still saving valuable BlockRam, what would the Address Generator code look like? Same as above? And if so, safe to comment the above out? But then how to define the length?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Nov 30, 2011 8:45 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
ElEctric_EyE wrote:
So if one wanted a 4K 'shared' ZeroPage/Stack on the 65Org16 for instance, one where zero page and stack crept towards each other, but still saving valuable BlockRam, what would the Address Generator code look like? Same as above? And if so, safe to comment the above out? But then how to define the length?


Basically, with a 32 bit address bus, and 32 bit registers, you'd get this:

Code:
JSR0: AB = regfile;   // instead of AB = { STACKPAGE, regfile}
INDY1: AB = ADD; // instead of AB = { ZEROPAGE, ADD}

...and so on....


AB would become a 32 bit register, and the regfile/ADD/ABL registers as well.


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 01, 2011 4:24 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
kc5tja wrote:
BigEd wrote:
kc5tja reckoned that indirect indexed is not 'needed' in the sense that modern RISC does without

This is not to suggest it should be removed; I'm just curious what gain we get from having it.

Good -- this is certainly worth discussing. But I'm puzzled. Is it [ZeroPage],Y mode we might omit, [ZeroPage] mode, or both? IIRC, Arlet's core offers only [ZeroPage],Y. If we omit that then ZeroPage loses its functionality as an array of Pointer Registers. (Reminder to myself: with 65Org32, ZeroPage is the whole address space. This takes some getting used to!)

kc5tja wrote:
BigEd wrote:
Specifically on your point though - have you checked ARM? How about this:
Code:
str r0, [r1], r2

Move contents of r0 into address pointed to by r1, and increment r1 by contents of r2. *r1 = r0, r1 = r1 + r2


Similar, but different: the offset is not applied to the address but is added to the indirection for the next time around.


This is not ARM-specific. PowerPC and MIPS both have similar "load/store and update" instruction forms. It makes perfect sense to have them, for they actually make more efficient use of the instruction pipeline than a simple load/store instruction would by re-using the ALU's results. These forms are used to replace pre-decrement / post-increment addressing modes found on CISC processors like the 68K/Coldfire family.

However, since the update always occurs after the memory access, software will need to bias the pointers before entering the loop, depending on which direction through memory it wants to cycle through.

How does "load/store and update" re-use ALU results? I seem to be missing something.

-- Jeff


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 7:02 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
I've a feeling what's really going on is that 6502 and other old-school micros are happy to do complex address calculations. Even in 65Org32, the [address],Y means
    fetch
    decode
    read from indirect address
    add Y
    read from effective address
    operate (can be pipelined)
(this is off the top of my head.) In 6502 and similar, that third line is actually two reads, but then the addition of Y can be overlapped. I should run this on visual6502 really - but if I've got this right it tells me that 65Oorg32's state machine doesn't simplify quite the way I thought it did.

In RISC, one wishes to pipeline in an aggressive and also in a simple and regular way. So complex address calculations are out. Updating a pointer register and writing back is actually just part of an 'operate' cycle and doesn't pose any challenge to pipelining.
    fetch
    decode (is also usually pipelined)
    operate (can easily be pipelined in 6502 fashion)

(Operate is normally called execute. A write to memory is a slightly special case especially when it has to share the same bus as fetch, but usually it's handled by a subsystem. Again, this is thinking out loud.)

So, I suspect that Samuel's observation that modern RISC does without these addressing modes is not an observation about how they are not useful to an assembly language programmer, but how they are difficult for the (simple, fast) implementation.

And, Jeff, you're right to note that RISC offers lots of registers to use as pointers - we need our 128 pairs in 6502's zero page, or our 4 billion bytes in 65Org32's.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 7:39 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
I've a feeling what's really going on is that 6502 and other old-school micros are happy to do complex address calculations. Even in 65Org32, the [address],Y means
    fetch
    decode
    read from indirect address
    add Y
    read from effective address
    operate (can be pipelined)
(this is off the top of my head.) In 6502 and similar, that third line is actually two reads, but then the addition of Y can be overlapped. I should run this on visual6502 really - but if I've got this right it tells me that 65Oorg32's state machine doesn't simplify quite the way I thought it did.


Correct. In my core the following states are used:

FETCH - fetch next opcode, and perform prev ALU op
DECODE - IR is valid, decode instruction, and write prev reg
INDY0 - (ZP),Y - fetch ZP address, and send ZP to ALU (+1)
INDY1 - (ZP),Y - fetch at ZP+1, and send LSB to ALU (+Y)
INDY2 - (ZP),Y - fetch data, and send MSB to ALU (+Carry)
INDY3 - (ZP),Y) - fetch data (if page boundary crossed)

For the 65Org32, you'd be able to elimintate state INDY3, but that state is normally only executed during page boundary crossings anyway, so it won't save much on typical programs.

You could remove a cycle by combining the add Y with the effective address read, by using a dedicated offset adder in the address generator:

Code:
INDY1: AB = DI + regfile


Of course, this may impact the max clock speed.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 10:08 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8517
Location: Southern California
On the 6502/816, reading the operand (ZP address) happens at the same time as decode. This is from the data sheet, address mode 7. Reading that next byte is pretty normal. It is also the reason for low-byte-first: If it needs an operand, it needs at least the low byte, but it doesn't know yet if it needs a high byte until the instruction is decoded. If there is an addition though, you start with the low byte anyway, and that operation can now be started while the processor is fetching the high byte.

So modifying what the data sheet says for the fact that we never have to fetch twice for an address, we have:

1. fetch the op code
2. fetch the operand (the first address), while decoding the op code
3. fetch the contents of the first address and add the index register to it, getting the final address
4. fetch the contents of the final address


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 10:32 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
GARTHWILSON wrote:
On the 6502/816, reading the operand (ZP address) happens at the same time as decode.


In my core, it's similar, but there's an extra pipeline stage. Reading from memory takes a full cycle, so while the address bus is already set during the decode cycle, it takes until the INDY0 cycle before we can actually get the first operand.

In my comment, when I wrote 'fetch ZP address', it actually means: grab the ZP address from the DI bus. The memory access already started in the cycle before that.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 7:22 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
GARTHWILSON wrote:
On the 6502/816, reading the operand (ZP address) happens at the same time as decode.
Indeed. But we still need to fetch from ZP.
Quote:
1. fetch the op code
2. fetch the operand (the first address), while decoding the op code
3. fetch the contents of the first address and add the index register to it, getting the final address
4. fetch the contents of the final address
Note that step 3 takes two clock cycles: one to fetch, and one to add. You can't add to data which you don't yet have. (Maybe you were only intending to list the memory accesses, not the clock cycles.)

Although 65Org32 needs to do 32-bit arithmetic, I'd hope that wasn't too much of an impact on max clock frequency. I'll run a test and update.

For spartan3:
    8bit data/16bit address: 57.267MHz
    16bit data/32bit address: 54.457MHz
    32bit data/40bit address: 52.596MHz

Note that this isn't a 65Org32 (or 65org40) it's just a quick idea as to whether the wide arithmetic is costly. For a quick unconstrained synthesis, it seems not.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Dec 01, 2011 10:42 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8517
Location: Southern California
Since I suspect the FPGA's internals can run a lot faster than an external bus of non-computer-scientist design and construction, I have no problem with having the input clock (kind of like phase 0) be two or more times the bus speed (like phase 2) if it helps do more steps per bus clock and eliminate dead bus cycles.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 12:23 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 397
Location: Minnesota
Quote:
Modifying JMP/JSR/RTS/RTI/BRK to only use 32 bit addressing would be a bit more complicated (and would also require changes in the assembler for JMP/JSR), but it would save a cycle, and would make the JMP/JSR instructions shorter.


Hmm. If I understand all this correctly, it seems to me every instruction on this proposed processor will be either one or two bytes. One if no memory access is needed, otherwise an opcode plus an address. Correct?

I suppose if all that's proposed is to implement the base 6502 instruction set, minus most three byte instructions and modifying any remaining to two bytes, that's reasonable enough. Not too difficult to implement, but goodbye to the kind of isomorphism that let me get away with using the same macro set to define instructions for both the 6502 and the 65Org16. C'est la vie.

But as long as hardware changes on this order are necessary anyway, has anyone considered pushing it even farther? How about a "true" zero page limited to the first 16MB? Eight bits of instruction, 24 bits of address. Opcode and address are both fetched in the same cycle, making "zero page" instructions both faster and shorter than "absolute" instructions, just like its ancestor. And there are still 16 million "registers", if you like to think of them that way.

Heck, someone might even want to put a hardware stack in the second 16MB :-)

An even more radical step would be to limit operations above the first 16MB (or 32MB) to loads and stores. No "INC abs", for instance ("INC zp", on the other hand, would be fine). That's quite RISC-y, as I understand the term.

Just tossing out ideas here. Do they bend the cores already out there too far to be used as bases?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 6:49 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
teamtempest wrote:
Quote:
Modifying JMP/JSR/RTS/RTI/BRK to only use 32 bit addressing would be a bit more complicated (and would also require changes in the assembler for JMP/JSR), but it would save a cycle, and would make the JMP/JSR instructions shorter.
Hmm. If I understand all this correctly, it seems to me every instruction on this proposed processor will be either one or two bytes. One if no memory access is needed, otherwise an opcode plus an address. Correct?

Yes, I think that sums it up
Quote:
I suppose if all that's proposed is to implement the base 6502 instruction set, minus most three byte instructions and modifying any remaining to two bytes, that's reasonable enough. Not too difficult to implement, but goodbye to the kind of isomorphism that let me get away with using the same macro set to define instructions for both the 6502 and the 65Org16. C'est la vie.
I'm not sure about "minus most three byte instructions" - the instructions are there but the length is different. I think perhaps two interesting things happen:
    - detecting short addresses to choose zero page versus absolute will always choose zero page
    - so some of the absolute forms never get used. So they need not be implemented, which does leave room for alternate opcodes.
Looking at the opcode map, columns 4,5,6 mostly overshadow everything in columns C, D and E. The exceptions are
Code:
 4c JMP a
 6c JMP (a)
 94 STY d,X
 96 STX d,Y

As noted before, the target of a JMP (for example) is only one byte. So maybe the assembler sees absolute addresses as one byte, and the detection of short addresses falls away, and the overshadowing goes the other way?

Quote:
But as long as hardware changes on this order are necessary anyway, has anyone considered pushing it even farther?...

Interesting idea: all perfectly realisable. Two very small objections, or obstacles:
    - 65Org16 and 65Org32 are minimal deviations from an existing core, in order that they are implemented with minimal effort and minimal discussion about spec.
    - defining and using specific 16MB pages like that does mean that the physical memory map needs to populate at least some of those pages.
It might be worth a new thread for a 65Org24 (or other name) to see where to go with this one.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 8:27 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
    8bit data/16bit address: 57.267MHz
    16bit data/32bit address: 54.457MHz
    32bit data/40bit address: 52.596MHz
Note that this isn't a 65Org32 (or 65org40) it's just a quick idea as to whether the wide arithmetic is costly. For a quick unconstrained synthesis, it seems not.


One concern: bit shifting up to 32 bits becomes quite cumbersome with only single-bit shift instructions, and because there are no 8 bit units (octets), you can't take any shortcuts by shifting per octet either.

So, it seems an extension for a barrel shift instruction would be quite useful, slowing down the core by quite a bit (although you could make it a N-clock cycle instruction)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 8:51 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
I wonder if a barrel shifter would be slow - it's a giant multiplexing structure. There's one way to find out!

On the other hand, a couple of 8-bit rotate instructions would speed up long distance shifts (I'm thinking of rotating the 32-bit A by 8 bits, ignoring carry)

Speaking of ignoring carry, we would need to consider actual use cases. Porting some real code fragments might illuminate what's needed.

With a macro assembler, a sequence of some tens of single-bit shifts isn't so ugly, it just takes a few cycles: this might never matter for performance. When doing real work like multiply or CRC, each single shift is followed by some action.

Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 8:55 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
With a macro assembler, a sequence of some tens of single-bit shifts isn't so ugly


Until you realize you've just spent 4% of a precious block RAM on a single 20 bit shift. :)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Dec 02, 2011 9:05 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8517
Location: Southern California
My original 65Org32 proposal (it would be good to review the topic) was that it would be like the '816 in that it has the (32-bit) offset registers and the extra instructions, but it would have no emulation mode, no page or bank boundaries, no address bus multiplexing, and no 3- or 4-byte instructions. It would have the barrel shifter, and it ought to have a MULtiply instruction if not also a DIVide. I can't get excited about a 32-bit NMOS 6502, although I know that would be easiest to do just extending the widths on the Verilog 6502 models. The other extreme is to go to a ton of registers, deep pipelining, branch prediction, onboard cache, etc., and end up with something has has little to no resemblance to the 6502 or '816, like the 65GZ032 project went, and, after a lot of progress and even some working hardware, still fizzled out before it was done.

I might still see having direct-page, absolute, and "long" addressing though, all three, even though they all cover the same 4 gigaword address range; because sometimes you want the DP offset so you use the DP addressing, sometimes the DBR or PBR offset so you use absolute addressing, and sometimes no offset so you use "long" addressing (although it's no longer than the others-- it just ignores the offsets).

I personally don't expect to ever address more than 16 megawords of address space; but being able to handle 32 bits at a time is important in higher-level languages; and in Forth, any cell might be an address, data, an index value, etc.. Although merging 8-bit op codes with 24-bit addresses would save some memory and bus cycles, I suspect it will increase internal complexity, perhaps reduce maximum clock speed, and sometimes even make for more-difficult programming.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 85 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 23 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: