Page 1 of 6

6502 or ARM core selection?

Posted: Fri Aug 12, 2011 1:00 pm
by sixtyfive02
Hi all,

I'm new on this forum. I found this forum really interesting bacause it probably collects the most useful and complete information for the 6502.... and probably the most expert users.

I need to select a processor for low domanding generic applications (control, DSP like on 8bit, etc..). I would like to have your opinion about the best choise between a 6502 compatible core and an ARM7TDMI compatible core (or similar). For sure the ARM7 has higher performance but my interest is to have the best compact code (low cost embedded application) and the lower gate count. Do you have any generic advices or guide line to compare these two cores? For example, it could be useful for me to know the equivalent DMIPS/Mhz for the 6502 because I didn'd find it on internet. Another useful information could be to know if 6502 perform better than ARM7 in term of code density or not. Having this data I could try to figure out tradeoff by myself.

I found some 6502 and some ARM7 compatible cores on internet and I would need some advices about the selecion criteria. Regarding the 6502 for example I found cores like these: bc6502 and T65 that they are claimed to be working. I found also a recent 6502 compatible core at www.ex6502.altervista.org called 6502EX. It seems a 6502 extended to 32 bit, so probably something in between the original 6502 and the ARM7, but I'm not sure if it is available.

Let me thanks in advance all of you that will like to give me some feedback to help me in this selection.

Best Regards

Posted: Fri Aug 12, 2011 6:58 pm
by BigEd
Welcome!

Samuel found a table showing the 6502 (actually the 6510, but that's the same thing) as having 32 or 36 Dhrystone score at 1 MHz (and therefore about 0.02 DMIPS/MHz)

Thanks for finding the 6502EX - I was unaware of that(*). It's a 32-bit extension, with an 8-bit mode, and lots of registers and operations on 4 byte lanes and some remnant of 64k banks in 32bit mode. (But it seems to be closed source and presumably intended to be licensed for money, so not of direct interest to me.)

What ARM-compatible cores have you found? ARM are rather protective of their patents.

Cheers
Ed

Edit to add: for code density, see the paper "Code Density Concerns for New Architectures" by Vincent M. Weaver and Sally A. McKee

(*) But it does seem to have appeared only within the last month or so and is new to Google too. Are you the first person to find it?

Edit: added conversion between Dhrystone score and DMIPS

Posted: Fri Aug 12, 2011 7:24 pm
by GARTHWILSON
Welcome.
Quote:
For example, it could be useful for me to know the equivalent DMIPS/Mhz for the 6502 because I didn'd find it on internet.
The average will be about four clocks per instruction, IOW 1MIPS @ 4MHz, 4MIPS @ 16MHz, etc.. What beginners often don't realize is that certain instructions are merged, so they do redundant things like an addition and then compare the result to 0, which is already done. ADC# for example does its five steps in only two clocks, and that includes the implicit CMP#0.

I have brought several products to market with PIC16 microcontrollers and can tell you the PIC takes about twice as many instructions and twice as many clocks to do a job as the 6502 does-- if the PIC can do it at all.

Where the 6502 shines most is interrupt performance. It is not unrealistic for a 20MHz 6502 to service a million interrupts per second, although that means a very simple ISR that only increments a byte in memory for example. (See my interrupts article. I also have one on servicing interrupts in high-level Forth, with zero overhead.

If you really want code density, the best way I know of to get it is to use Forth. That assumes however that the application is large enough to justify having the kernel. Obviously it's not compact to have thousands of bytes of kernel for something that could be done in 70 bytes of assembly if the application is really that small.

Someone pointed to an article last year telling about the ARM being inspired by the 6502, but I didn't bookmark it and I can't remember the right search terms to find it again. Hopefully someone else will tell us. We all wish Samuel Falvo would come back. I'm sure he knows plenty about the comparison.

The fastest 6502's are running at over 200MHz in custom ICs. That would require extremely fast memory and I/O, onboard the same IC. Off-the-shelf ones are rated at 16MHz but will usually run quite a bit faster if the rest of your hardware is up to it. I would definitely recommend the 65816 though. When I was writing my '816 Forth kernel, I found that it was actually easier to program when constantly working with 16-bit cells, and my '816 Forth ran 2-3 times as fast as my '02 Forth at a given clock speed. The difference in price is negligible. The '816 is much better suited for multitasking, relocatable code, etc..

Electric_Eye here is working on a 16/32-bit version of the 6502 in an FPGA. My proposal for the 65Org32, an all-32-bit 6502, is described in this lengthy (9-page) topic. Note that there are no 8- or 16-bit entities (they're not needed), and that with all registers being 32-bit, it's like everything is in zero page, although the equivalent of the 65816's DP (direct-page) register is also 32-bit and allows offsets to be anything at all, with no page or bank boundaries. The same goes for the equivalent of the 65816's data and program banks-- there are the registers but they're 32-bit also so there are no bank boundaries.

Posted: Fri Aug 12, 2011 8:43 pm
by ElEctric_EyE
Arlet has done some comparing and mentioned here that he could get his 6502 8-bit core running up to 111MHz.

I'm just abit busy right now... How many MIPS would that be?

Posted: Fri Aug 12, 2011 9:38 pm
by GARTHWILSON
Quote:
I'm just a bit busy right now... How many MIPS would that be?
Around 28, depending. For example, a lot of ZP addressing without indirects will bring it up a bit, and a lot of indirects and indexing will bring it down a bit.

Posted: Fri Aug 12, 2011 9:51 pm
by Tor
GARTHWILSON wrote:
Someone pointed to an article last year telling about the ARM being inspired by the 6502, but I didn't bookmark it and I can't remember the right search terms to find it again.

I remembered having read that as well, and I remembered what I was searching for at the time - I was reading obituaries about Personal Computer World (I was a subscriber from the beginning back when I was 17. I learned English by reading that magazine. But I digress.) So I just went through the same chain of searches and ended up here: http://www.theregister.co.uk/2009/06/11/pcw/page2.html
Quote:
Inmos? Ask anybody in the street today: “Never heard of it.”

Let’s not list all the things that went wrong. There is one star survival, heritage, legacy, what you will, from the UK industry in the 80s. Sophie Wilson, the best 6502 programmer ever, became disappointed with what she could do with the BBC Micro, and went off on her own to design a RISC processor that would do all the good things she liked about the 6502, and all the other things which she wished the 6502 could do.

The chip was the Acorn Risc Machine – the ARM, which started out merely as “the chip inside the Acorn Archimedes”.
Edit: A less poetic reference is maybe: http://www.engineersgarage.com/articles ... processors
which merely refers to ".. with latencies as low as that of the 6502".
There's another one here, an interview with Steve Furber (the other designer of te ARM processor), where he refers to the 6502: http://queue.acm.org/detail.cfm?id=1716385
Quote:
What we found was that the 16-bit micros in the early 1980s had worse interrupt response time than the 6502.
-Tor

Posted: Fri Aug 12, 2011 9:53 pm
by BigEd
GARTHWILSON wrote:
Quote:
I'm just a bit busy right now... How many MIPS would that be?
Around 28, depending. For example, a lot of ZP addressing without indirects will bring it up a bit, and a lot of indirects and indexing will bring it down a bit.
I reckon that would equate to about 2 DMIPS according to Sam's figures (Dhrystone being a synthetic benchmark aiming to measure some approximation of VAX-equivalent MIPS, as opposed to the machine native operations per second, which I think is your 28)

Cheers
Ed

Posted: Fri Aug 12, 2011 10:25 pm
by BigEd
On the business of 6502 inspiring the ARM, some of the story is elaborated on wikipedia, but Sophie maintains that "inspired by" isn't the right choice of words. Other people, myself included, also Sam, do use that word.

Cheers
Ed

Posted: Sat Aug 13, 2011 10:00 am
by sixtyfive02
Many thanks for your advise!

1. I'll carefully read Samuel table to understand how 0.02DMIPS/Mhz come out. It looks really low if compared to arm7 (about 0.7DMIPS/Mhz I think).

2. About ARM clone, I found two projects on www.opencores.org: Amber risc core and nnARM. They look working. Not sure yet if they implement thumb mode (I'm very interested to very compact code).

3. No idea if I'm the first to visit 6502EX. For sure this site is new, it has been update recently. I guess that the code will be get under payment but it's not clear. In this case I'm not interested too.

I fount also 65GZ032. Any comments on this? DMIPS? Code density? Is the code available for free?

Thanks again for your feedback.

Posted: Sat Aug 13, 2011 10:03 am
by sixtyfive02
Hi,

Thanks a lot for your information. Really a lot and valuable.

I understood that 6502 will be better than ARM in term of interrupt latency. This is a good point!

You seem to suggest me to go for a 65816. Is there any RTL code for free? I'm interested to work on fpga.

For sure I'll have a look to the 9 pages describing electric_eye implementation. Really interesting to know!

Thanks again for your help!

Posted: Sat Aug 13, 2011 11:00 am
by Arlet
sixtyfive02 wrote:
I understood that 6502 will be better than ARM in term of interrupt latency. This is a good point!
This was only true in some cases on an ARM7, and is certainly no longer true on a Cortex. The Cortex core has an interrupt latency of only 12 cycles, which includes saving your registers, and dispatching to the correct handler.

Posted: Sat Aug 13, 2011 11:16 am
by BigEd
sixtyfive02 wrote:
I'll carefully read Samuel table to understand how 0.02DMIPS/Mhz come out. It looks really low if compared to arm7 (about 0.7DMIPS/Mhz I think).
Here are some ideas:
  • lots of registers - avoids spilling to/from zero page a lot
    memory bus width - 4x difference
    cache - allows for harvard-like architecture internally, overlap of data and instruction activity
    predication - avoids a branch penalty in many cases
    more powerful instructions and addressing modes - get more work done for each instruction
Thanks for the pointer to Amber RISC Core - I've a feeling nnARM is defunct, due to legal pressure perhaps.

Cheers
Ed

Posted: Sat Aug 13, 2011 1:06 pm
by BigEd
Tor wrote:
GARTHWILSON wrote:
Someone pointed to an article last year telling about the ARM being inspired by the 6502, but I didn't bookmark it and I can't remember the right search terms to find it again.

Edit: A less poetic reference is maybe: http://www.engineersgarage.com/articles ... processors
which merely refers to ".. with latencies as low as that of the 6502".
There's another one here, an interview with Steve Furber (the other designer of te ARM processor), where he refers to the 6502: http://queue.acm.org/detail.cfm?id=1716385
Quote:
What we found was that the 16-bit micros in the early 1980s had worse interrupt response time than the 6502.
-Tor
Hi Tor
thanks for your research: I've taken these comments and pasted into a new thread - hope that's OK.
Cheers
Ed

Posted: Sat Aug 13, 2011 7:49 pm
by GARTHWILSON
You haven't said what you want to do with it. The 6502 was not intended for multitasking [Edit, 5/15/14: I posted an article on simple methods of doing multitasking without a multitasking OS, at http://wilsonminesco.com/multitask/index.html] and large applications compiled from high-level languages (which is what Dhrystone is all about); but OTOH, it was simple enough that writing hand-optimized assembly was far more practical on the 6502 than it is on most modern processors. IOW, Dhrystone MIPS may or may not be relevant, depending on what you want to do with it.

Forth on the 6502 normally has 16-bit cells, and LOOP (or actually its internal, loop) has a fair amount of work to do to increment a 16-bit loop counter on the stack and compare it to a 16-bit limit which is also on the stack. But I wrote one for 32-bit, and it just got ridiculously long. If you had to do everything in 32-bit, it would be a sloth; but if you never need more than 8 bits for the loop counter and limit, DEX (possibly followed by CPX) and then BNE does the job, taking only 5 or 7 clocks per loop. Many of the 6502's contemporaries couldn't even execute a single instruction in so few clocks. If we made an all-32-bit 6502 (65Org32), the 32-bit one could also be done in 5 or 7 clocks.

But look how much more efficiently the 65816 does the 16-bit fetch (@ in Forth, which takes a 16-bit address at the top of the data stack and replaces it with the 16-bit contents of that address) compared to the 6502.

First for the 6502:

Code: Select all

       LDA  (0,X)
       PHA
       INC  0,X
       BNE  fet1
       INC  1,X
fet1:  LDA  (0,X)
       JMP  PUT

PUT:   STA  1,X    ; and elsewhere, PUT which is used in so many places is:
       PLA
       STA  0,X
For the '816, the whole thing is only:

Code: Select all

       LDA  (0,X)
       STA  0,X         ; For the '816, PUT is only one 2-byte instruction anyway, so there's no sense in jumping to it.
So it takes 10 instructions on the 6502 to do what the 65816 can do in two.
Quote:
Quote:
I understood that 6502 will be better than ARM in term of interrupt latency. This is a good point!
This was only true in some cases on an ARM7, and is certainly no longer true on a Cortex. The Cortex core has an interrupt latency of only 12 cycles, which includes saving your registers, and dispatching to the correct handler.
Is that 12 actual clocks (pulses)? Most interrupt handling should not require saving all the registers, and with a hardware-complexity penalty (or actually not if you do it in an FPGA), you can have the interrupt put the right address in the vector location so the correct handler is jumped to directly. IOW, the 6502 may still be able to do it in 7 or 11 clocks, never more than 19 (that is, pushing A, X, and Y; P is already pushed). OTOH, the 6502's longer instructions will add an average of a couple more clocks because the currently-executing instruction has to finish before the interrupt sequence can start. (I call them clocks, not cycles, because on some other processors they use the term "cycle" when they mean a set of four or more clock pulses, which is deceiving, like PIC's four clocks per cycle.)

As for registers, basically all 256 bytes of ZP is processor registers on the 6502, and the '816 lets each task have its own ZP.

I'm not arguing of course that the 6502 is a performance match for the ARM, but rather that for small applications, much of the perceived advantage of a higher-end processor might evaporate.

16 Bit Forth

Posted: Sat Aug 13, 2011 11:39 pm
by BigDumbDinosaur
GARTHWILSON wrote:
For the '816, the whole thing is only:

Code: Select all

       LDA  (0,X)
       STA  0,X         ; For the '816, PUT is only one 2-byte instruction anyway, so there's no sense in jumping to it.
Don't forget the above sequence would also include REP #%00100000 to enable 16 bit loads and stores and SEP #%00100000 to revert to 8 bit mode. So it would take a little longer to execute (4 cycles longer, to be exact). However, it's still much faster than anything that could be coded on the 65(c)02.

The 65C816 has a number of useful stack instructions that are almost tailor-made for languages such as Forth. For example, you can push an address to the stack and then read the contents of that address without touching any zero page memory. First, the 65(c)2 way:

Code: Select all

          LDX #<addr
          LDY #>addr
          STX zpptr
          STY zpptr+1
          LDY #0
          LDA (zpptr),y
Now, the 65C816 way:

Code: Select all

          REP #%00100000        ;select 16 bit .A
          LDA #addr             ;load full address in one operation
          PHA                   ;put it on the stack
          SEP #%00100000        ;select 8 bit .A
          LDY #0                ;index
          LDA (1,s),y           ;grab byte like LDA (zpaddr),y
One of the handy things about these sort of stack acrobatics is the relative ease at which fully relocatable code can be developed. However, the 65C816 offers another approach and that is the ability to relocate zero page. That feature makes it practical to give each subroutine its own zero page (and stack, which can also be relocated). When the subroutine has finished, put ZP and the stack back where they used to be and go on your way.