6502 or ARM core selection?

GARTHWILSON · Post by **GARTHWILSON** » Sat Aug 13, 2011 11:50 pm

Quote:

Don't forget the above sequence would also include REP #%00100000 to enable 16 bit loads and stores and SEP #%00100000 to revert to 8 bit mode.

Actually no, the '816 is more efficient than that, because in Forth, you leave the accumulator in 16-bit mode and the index registers in 8-bit mode almost full-time. There are very few REPs and SEPs in my '816 Forth kernel. I know people are afraid of those REPs and SEPs, but they're a rare occurrence in this application.

GARTHWILSON · Post by **GARTHWILSON** » Sun Aug 14, 2011 12:23 am

BigDumbDinosaur wrote:

The 65C816 has a number of useful stack instructions that are almost tailor-made for languages such as Forth. For example, you can push an address to the stack and then read the contents of that address without touching any zero page memory. First, the 65(c)2 way:

Code: Select all

          LDX #<addr
          LDY #>addr
          STX zpptr
          STY zpptr+1
          LDY #0
          LDA (zpptr),y

Now, the 65C816 way:

Code: Select all

          REP #%00100000        ;select 16 bit .A
          LDA #addr             ;load full address in one operation
          PHA                   ;put it on the stack
          SEP #%00100000        ;select 8 bit .A
          LDY #0                ;index
          LDA (1,s),y           ;grab byte like LDA (zpaddr),y

You can go further than that and:

push a literal on the hardware stack in a single instruction with PEA which is "push 16-bit immediate data."
If you want to fetch the 16-bit number contained at that address and push it on the hardware stack, use PEI.
PER pushes onto the stack a 16-bit address relative to where the PC is at the moment, with the offset being specified in the operand. This is helpful in writing self-relocatable code.

None of these affect the flags or the accumulator, and they are always 16-bit regardless of the M or X flags, so REP and SEP are never necessary.

So how I would do the above would be:

Code: Select all

          PEA addr   ; (The # is not used for PEA, even though it's always immediate.)
          LDA (1,s),y

NEXT left Y at 0 anyway, so there's no need to re-zero it, and we almost always deal in 16-bit cells, so I just leave the M and X flags alone.

The above is only if you want to leave that address on the return stack though. Otherwise we're back to just LDA addr, a single instruction for the whole operation.

Quote:

One of the handy things about these sort of stack acrobatics is the relative ease at which fully relocatable code can be developed. However, the 65C816 offers another approach and that is the ability to relocate zero page. That feature makes it practical to give each subroutine its own zero page (and stack, which can also be relocated). When the subroutine has finished, put ZP and the stack back where they used to be and go on your way.

Having the data stack (not the return stack) in ZP, "subroutines" (which could be anything in Forth) never step on the variables of other pending routines (that's the nature of a stack), so I never had any reason to change the DP register from zero in my Forth kernel. The best use of changing the DP register appears to be true multitasking, so every independent task can have its own "zero page." I have managed to do all the multitasking I need with interrupts, including servicing interrupts in high-level Forth with no overhead as my article describes, so I have not gotten into true multitasking. I call it "pseudomultitasking" for lack of a better term. The interrupt performance of the 65xx processors allows this to be practical.

Arlet · Post by **Arlet** » Sun Aug 14, 2011 5:36 am

GARTHWILSON wrote:

Is that 12 actual clocks (pulses)? Most interrupt handling should not require saving all the registers, and with a hardware-complexity penalty (or actually not if you do it in an FPGA), you can have the interrupt put the right address in the vector location so the correct handler is jumped to directly.

Yes, 12 actual clocks (assuming zero wait-state memory). It doesn't save all registers, though. Only a handful of registers are saved, which allows you to use an interrupt routine written in C (or another high level language). If you need more registers, the function has to save them itself, like any other function.

The idea is that you can write a standard C function, and plug the address of the function into the vector table, and it will called when that interrupt occurs, using a standard calling context. All the normal 'scratch' registers that a function can use are saved by the interrupt hardware. I think there are 5 of those. The interrupt handler also allows 32 programmable priorities and automatic nesting of interrupts. On a typical controller, you'll have a few dozen different interrupt sources.

Also, the Cortex has an optimization when two interrupts overlap. As soon as the first one is handled, it jumps to the second one, skipping the part where it restores and saves the registers.

sixtyfive02 · Post by **sixtyfive02** » Sun Aug 14, 2011 1:31 pm

Here are some ideas:

lots of registers - avoids spilling to/from zero page a lot
memory bus width - 4x difference
cache - allows for harvard-like architecture internally, overlap of data and instruction activity
predication - avoids a branch penalty in many cases
more powerful instructions and addressing modes - get more work done for each instruction

------
I had a look to the Samuel tables. It is strange to see (first table) that z80 performs as the 6502 does (in letterature 6502 is claimed to be more efficient). Anyway, let's consider 0.02DMIPS/Mhz.

Giving that I tried to figure-out a different parameter considering also the gate count. In fact, gate count is fundamental for fpga user. I defined the DMIPS/Kates i.s.o. DMIPS/Mhz. From internet I found: arm7tdmi = 48kgates; 0.74DMIPS/Mhz in thumb mode (I'm interested in code density). For the 6502 I toke 3.5kgates (still from internet) and 0.02DMIPS/Mhz.

With this number we get: 0.015DMIPS/Kgates for the arm7; 0.006DMIPS/kgates for the 6502. I think that this figures represent better the architecture efficiency when the gate count is important. Giving these numbers the ratio between 6502 and arm7 is about 1:35 considering the DMIPS/Mhz and it decreases to 1:3 consideirng the DMIPS/kgates. No bad considering that arm7 can manipulate data in 32bit mode!

So, I thinks this a more fair way to compare the cores in order to find out a compromise between efficiency and gate count. May be having a 32 bit 6502 this gap could be even reduced.
I hope it make sense.

I'm still looking for code density statistics for 6502 but I didn't find nothing on internet at the moment.

Cheers

BigEd · Post by **BigEd** » Sun Aug 14, 2011 2:14 pm

This topic is now the top hit for 6502 code density, so: search for a pdf titled "Code Density Concerns for New Architectures" and you should find a diagram like this:

: From http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_density.pdf

(There are more breakdowns for different elements of the benchmark - all the sources are here and there are comments on the architectures here)

(Fair use exception to copyright, of course. Credit to the authors: Vincent M. Weaver and Sally A. McKee)

Cheers
Ed

Edit: replaced lost image with new screenshot

BigEd · Post by **BigEd** » Sun Aug 14, 2011 2:42 pm

To answer some more points:

sixtyfive02 wrote:

I fount also 65GZ032. Any comments on this? DMIPS? Code density? Is the code available for free?

For this, and for pointers to some other variations on the 6502 theme, and for André's own 65k project, have a look at his pages here. His project allows for 32bit, and more besides.

On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.

On this point:

GARTHWILSON wrote:

Electric_Eye here is working on a 16/32-bit version of the 6502 in an FPGA. My proposal for the 65Org32, an all-32-bit 6502, is described in this lengthy (9-page) topic.

I feel I should clarify: the 9-page topic was the latest of several brainstorms which covered all sorts of enhancement ideas for 6502. It resulted in a couple of specific ideas which seemed simple and attractive: the 16 and 32 bit versions which Garth mentions.

I coded up the 16-bit version, using Arlet's 6502 core as a base. I've implemented that core in an FPGA module, and TeamTempest and BitWise have both ported their 6502 assemblers to target the core. Electric_Eye has taken a fork of the core and started work on a spartan6 development board. I don't believe it would be much work to create the 32-bit variant (which would actually be simpler.)

sixtyfive02 wrote:

You seem to suggest me to go for a 65816. Is there any RTL code for free?

No, I don't think 65816 has been tackled: there's a fair chance it's protected.

Cheers
Ed

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon Aug 15, 2011 3:22 pm

GARTHWILSON wrote:

You can go further than that and:

push a literal on the hardware stack in a single instruction with PEA which is "push 16-bit immediate data."
If you want to fetch the 16-bit number contained at that address and push it on the hardware stack, use PEI.
PER pushes onto the stack a 16-bit address relative to where the PC is at the moment, with the offset being specified in the operand. This is helpful in writing self-relocatable code.

PEA pushes its operand to the stack, which is fine if the address (which it might not be—the operand could be anything, including a simple 16 bit constant) is static and known at assembly time, unless using self-modifying code. PEA isn't of any particular value if double-indirection is involved.

Consider the case where a subroutine is to print a text string selected by an index value in one of the registers. The first step (after validating the index value) would be to get the appropriate string pointers from a look-up table. With the 65(c)02, you'd store the pointers into a ZP location. With the '816, you could push them to the stack and access the string with LDA (1,S),Y. PEA could push the starting address of the look-up table to the stack, but you'd still be faced with loading the actual pointers into a register in order to prime the stack with the starting address of the string. Hence my code example.

bogax · Post by **bogax** » Mon Aug 15, 2011 5:51 pm

BigEd wrote:

Here are some ideas:

lots of registers - avoids spilling to/from zero page a lot
memory bus width - 4x difference
cache - allows for harvard-like architecture internally, overlap of data and instruction activity
predication - avoids a branch penalty in many cases
more powerful instructions and addressing modes - get more work done for each instruction

Ed

Not sure this is pertinent to this thread and I've never tryed to work
out details to know if it really makes sense.

Keep two copies of memory and make writes go to both but allow
reads in parallel. (so you get something like Harvard and possibly
get some of the advantage of registers for zp)

Dedicate a couple of bits in the instruction to jumps/subroutines
so that every instruction is (potentially) a jump or call to a subroutine
or a return. (so you can do something like microcoding without so
much subroutine time penalty or macro space penalty)
(would't necessarily have to be every instruction)

BigEd · Post by **BigEd** » Wed Aug 17, 2011 7:25 pm

bogax wrote:

Keep two copies of memory and make writes go to both but allow reads in parallel. (so you get something like Harvard and possibly get some of the advantage of registers for zp)

Yes, I could see that working - like a pair of mirrored disks. Thing is, everything is a tradeoff of cost/benefit, and this has cost a second memory interface and a second memory. Probably the pin count would be the thing I'd worry about most. An on-FPGA cache would have a similar effect, I think? And one could dedicate some part of that to zero page or stack, if it turned out worthwhile.

bogax wrote:

Dedicate a couple of bits in the instruction to jumps/subroutines so that every instruction is (potentially) a jump or call to a subroutine or a return. (so you can do something like microcoding without so much subroutine time penalty or macro space penalty) (would't necessarily have to be every instruction)

Not sure what this idea is: if you dedicate one or two bits to flag a jump or jsr, where do you get the destination from? Another operand I suppose? So this is an idea for packing a pair of operations into a single instruction fetch and decode... a bit like predication, but that case theres no need for an extra operand. The thing about this idea is that jmp/jsr may not be so common as to make this much of a saving. Common actions like increment/decrement, so you get compound actions like post or pre increment/decrement, are perhaps more likely to earn their keep?

Cheers
Ed

sixtyfive02 · Post by **sixtyfive02** » Fri Aug 19, 2011 6:47 pm

Quote:

You haven't said what you want to do with it. The 6502 was not intended for multitasking and large applications compiled from high-level languages (which is what Dhrystone is all about); but OTOH, it was simple enough that writing hand-optimized assembly was far more practical on the 6502 than it is on most modern processors. IOW, Dhrystone MIPS may or may not be relevant, depending on what you want to do with it.

In my case, I would like to build a platform into an FPGA. The platform should be generic enough to develop firmware for simple system to perform industrial control (including control of LCD) but also simple DSP algorithm, so a low demanding platform.
I would prefer investing on 6502 (that is in fact what I would like to do since a love this processor for its simplicity) but I don't want to discover after few months that a need more DMIPS to do something that today I don't foresee. Having an ARM7 clone I could even reuse AHB peripherals that I can found on internet speeding-up the assembly and the verification of the FPGA.

BTW, I think to understand your advice: 6502 is better for simple applications when hand made code is required. 65816 would be bettter but no clones are available today in RTL format.

So, the major doubt remain around the DMIPS that I will require according to the specific application but probably I could mitigate this potential problem assembling two 6502 cores into the FPGA. Its gate count is so small that I could fit at least a couple of them.

Thank you!

sixtyfive02 · Post by **sixtyfive02** » Fri Aug 19, 2011 7:17 pm

Quote:

For this, and for pointers to some other variations on the 6502 theme, and for André's own 65k project, have a look at his pages here. His project allows for 32bit, and more besides.

Really interesting project, congratulation to Andre'! Anyway I think it needs tens and tens of man years to finalize this roadmap .... we shall to wait.

Quote:

On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.

Thank you Ed, really useful. According to your experience are all these cores mature to be used or is there a version that you would suggest to be more stable respect to the other?

Quote:

I coded up the 16-bit version, using Arlet's 6502 core as a base. I've implemented that core in an FPGA module, and TeamTempest and BitWise have both ported their 6502 assemblers to target the core.

Congratulation for this 16-bit working derivative....!

Cheers

BigEd · Post by **BigEd** » Fri Aug 19, 2011 9:12 pm

sixtyfive02 wrote:

Quote:

On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.

Thank you Ed, really useful. According to your experience are all these cores mature to be used or is there a version that you would suggest to be more stable respect to the other?

I don't know if they are all thoroughly excellent - one might have to do some additional research. The T65 and Syntiac, and retromaster cores have been used in games which I think makes them pretty thoroughly exercised. I have a lot of faith in Arlet's and I gather he's used it in a real-world project. (I also have a strong preference for verilog.)

(The ones I don't mention in the above paragraph might also be splendid - I don't mean anything by not naming them.)

There's a test suite by Wolfgang Lorenz - it's a good sign to pass that - but correctly responding to RDY and interrupts in all cycles of all instructions is another level of detail. Having precise timing and the right bus activity in each cycle is necessary for games. I collected some test suite info in a wiki page.

Quote:

Congratulation for this 16-bit working derivative....!

Thanks! (Shoulders of giants...) (Edit: both for the core and the emulator, I only changed a couple hundred lines of code in pre-existing 6502 projects, mostly trivial edits to allow for 16-bit bytes.)

Cheers
Ed

BigEd · Post by **BigEd** » Sat Aug 20, 2011 1:37 pm

I just posted elsewhere a link to an article with a few well written paragraphs on ARM's feature set, performance and code density.

fachat · Post by **fachat** » Wed Aug 24, 2011 8:35 pm

GARTHWILSON wrote:

BigDumbDinosaur wrote:

The 65C816 has a number of useful stack instructions that are almost tailor-made for languages such as Forth. For example, you can push an address to the stack and then read the contents of that address without touching any zero page memory. First, the 65(c)2 way:

Code: Select all

          LDX #<addr
          LDY #>addr
          STX zpptr
          STY zpptr+1
          LDY #0
          LDA (zpptr),y

Now, the 65C816 way:

Code: Select all

          REP #%00100000        ;select 16 bit .A
          LDA #addr             ;load full address in one operation
          PHA                   ;put it on the stack
          SEP #%00100000        ;select 8 bit .A
          LDY #0                ;index
          LDA (1,s),y           ;grab byte like LDA (zpaddr),y

Just to add my 2 cents :-)

The 65k way with its two new registers E and B could be something like:

Code: Select all

     LEA addr,Y            ; load register E with address "addr" and Y added
     LDA (E)

Another alternative would be something like:

Code: Select all

     LDA.W #addr        ; word load addr into AC, default is to zero-extend
     TAB                ; full width transfer to B
     LDA (B,0),Y        ; use B as base address

... which made me think that actually an opcode "LDB", load base register is missing.

André

BigEd · Post by **BigEd** » Sat Aug 27, 2011 1:07 pm

BigEd wrote:

Thanks for finding the 6502EX - I was unaware of that(*). It's a 32-bit extension, with an 8-bit mode, and lots of registers and operations on 4 byte lanes and some remnant of 64k banks in 32bit mode. (But it seems to be closed source and presumably intended to be licensed for money, so not of direct interest to me.)

sixtyfive02 wrote:

3. No idea if I'm the first to visit 6502EX. For sure this site is new, it has been update recently. I guess that the code will be get under payment but it's not clear. In this case I'm not interested too.

I notice an indication that this CPU might in fact end up open-sourced, which would be great!

Quote:

The 6502EX is a 32bit extension of the popular 6502 8bit processor. 6502EX source code is available under GNU GPL license

Cheers
Ed

6502 or ARM core selection?

Re: 16 Bit Forth

Re: 16 Bit Forth

Re: 16 Bit Forth