6502 or ARM core selection?
- GARTHWILSON
- Forum Moderator
- Posts: 8774
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
Don't forget the above sequence would also include REP #%00100000 to enable 16 bit loads and stores and SEP #%00100000 to revert to 8 bit mode.
Last edited by GARTHWILSON on Sun Aug 14, 2011 1:39 am, edited 1 time in total.
- GARTHWILSON
- Forum Moderator
- Posts: 8774
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 16 Bit Forth
BigDumbDinosaur wrote:
The 65C816 has a number of useful stack instructions that are almost tailor-made for languages such as Forth. For example, you can push an address to the stack and then read the contents of that address without touching any zero page memory. First, the 65(c)2 way:
Now, the 65C816 way:
Code: Select all
LDX #<addr
LDY #>addr
STX zpptr
STY zpptr+1
LDY #0
LDA (zpptr),yCode: Select all
REP #%00100000 ;select 16 bit .A
LDA #addr ;load full address in one operation
PHA ;put it on the stack
SEP #%00100000 ;select 8 bit .A
LDY #0 ;index
LDA (1,s),y ;grab byte like LDA (zpaddr),y- push a literal on the hardware stack in a single instruction with PEA which is "push 16-bit immediate data."
- If you want to fetch the 16-bit number contained at that address and push it on the hardware stack, use PEI.
- PER pushes onto the stack a 16-bit address relative to where the PC is at the moment, with the offset being specified in the operand. This is helpful in writing self-relocatable code.
So how I would do the above would be:
Code: Select all
PEA addr ; (The # is not used for PEA, even though it's always immediate.)
LDA (1,s),yThe above is only if you want to leave that address on the return stack though. Otherwise we're back to just LDA addr, a single instruction for the whole operation.
Quote:
One of the handy things about these sort of stack acrobatics is the relative ease at which fully relocatable code can be developed. However, the 65C816 offers another approach and that is the ability to relocate zero page. That feature makes it practical to give each subroutine its own zero page (and stack, which can also be relocated). When the subroutine has finished, put ZP and the stack back where they used to be and go on your way.
GARTHWILSON wrote:
Is that 12 actual clocks (pulses)? Most interrupt handling should not require saving all the registers, and with a hardware-complexity penalty (or actually not if you do it in an FPGA), you can have the interrupt put the right address in the vector location so the correct handler is jumped to directly.
The idea is that you can write a standard C function, and plug the address of the function into the vector table, and it will called when that interrupt occurs, using a standard calling context. All the normal 'scratch' registers that a function can use are saved by the interrupt hardware. I think there are 5 of those. The interrupt handler also allows 32 programmable priorities and automatic nesting of interrupts. On a typical controller, you'll have a few dozen different interrupt sources.
Also, the Cortex has an optimization when two interrupts overlap. As soon as the first one is handled, it jumps to the second one, skipping the part where it restores and saves the registers.
-
sixtyfive02
- Posts: 7
- Joined: 12 Aug 2011
Here are some ideas:
I had a look to the Samuel tables. It is strange to see (first table) that z80 performs as the 6502 does (in letterature 6502 is claimed to be more efficient). Anyway, let's consider 0.02DMIPS/Mhz.
Giving that I tried to figure-out a different parameter considering also the gate count. In fact, gate count is fundamental for fpga user. I defined the DMIPS/Kates i.s.o. DMIPS/Mhz. From internet I found: arm7tdmi = 48kgates; 0.74DMIPS/Mhz in thumb mode (I'm interested in code density). For the 6502 I toke 3.5kgates (still from internet) and 0.02DMIPS/Mhz.
With this number we get: 0.015DMIPS/Kgates for the arm7; 0.006DMIPS/kgates for the 6502. I think that this figures represent better the architecture efficiency when the gate count is important. Giving these numbers the ratio between 6502 and arm7 is about 1:35 considering the DMIPS/Mhz and it decreases to 1:3 consideirng the DMIPS/kgates. No bad considering that arm7 can manipulate data in 32bit mode!
So, I thinks this a more fair way to compare the cores in order to find out a compromise between efficiency and gate count. May be having a 32 bit 6502 this gap could be even reduced.
I hope it make sense.
I'm still looking for code density statistics for 6502 but I didn't find nothing on internet at the moment.
Cheers
- lots of registers - avoids spilling to/from zero page a lot
memory bus width - 4x difference
cache - allows for harvard-like architecture internally, overlap of data and instruction activity
predication - avoids a branch penalty in many cases
more powerful instructions and addressing modes - get more work done for each instruction
I had a look to the Samuel tables. It is strange to see (first table) that z80 performs as the 6502 does (in letterature 6502 is claimed to be more efficient). Anyway, let's consider 0.02DMIPS/Mhz.
Giving that I tried to figure-out a different parameter considering also the gate count. In fact, gate count is fundamental for fpga user. I defined the DMIPS/Kates i.s.o. DMIPS/Mhz. From internet I found: arm7tdmi = 48kgates; 0.74DMIPS/Mhz in thumb mode (I'm interested in code density). For the 6502 I toke 3.5kgates (still from internet) and 0.02DMIPS/Mhz.
With this number we get: 0.015DMIPS/Kgates for the arm7; 0.006DMIPS/kgates for the 6502. I think that this figures represent better the architecture efficiency when the gate count is important. Giving these numbers the ratio between 6502 and arm7 is about 1:35 considering the DMIPS/Mhz and it decreases to 1:3 consideirng the DMIPS/kgates. No bad considering that arm7 can manipulate data in 32bit mode!
So, I thinks this a more fair way to compare the cores in order to find out a compromise between efficiency and gate count. May be having a 32 bit 6502 this gap could be even reduced.
I hope it make sense.
I'm still looking for code density statistics for 6502 but I didn't find nothing on internet at the moment.
Cheers
This topic is now the top hit for 6502 code density, so: search for a pdf titled "Code Density Concerns for New Architectures" and you should find a diagram like this:
(There are more breakdowns for different elements of the benchmark - all the sources are here and there are comments on the architectures here)
(Fair use exception to copyright, of course. Credit to the authors: Vincent M. Weaver and Sally A. McKee)
Cheers
Ed
Edit: replaced lost image with new screenshot
(There are more breakdowns for different elements of the benchmark - all the sources are here and there are comments on the architectures here)
(Fair use exception to copyright, of course. Credit to the authors: Vincent M. Weaver and Sally A. McKee)
Cheers
Ed
Edit: replaced lost image with new screenshot
Last edited by BigEd on Mon Aug 25, 2014 6:33 pm, edited 3 times in total.
To answer some more points:
For this, and for pointers to some other variations on the 6502 theme, and for André's own 65k project, have a look at his pages here. His project allows for 32bit, and more besides.
On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.
On this point:I feel I should clarify: the 9-page topic was the latest of several brainstorms which covered all sorts of enhancement ideas for 6502. It resulted in a couple of specific ideas which seemed simple and attractive: the 16 and 32 bit versions which Garth mentions.
I coded up the 16-bit version, using Arlet's 6502 core as a base. I've implemented that core in an FPGA module, and TeamTempest and BitWise have both ported their 6502 assemblers to target the core. Electric_Eye has taken a fork of the core and started work on a spartan6 development board. I don't believe it would be much work to create the 32-bit variant (which would actually be simpler.)
No, I don't think 65816 has been tackled: there's a fair chance it's protected.
Cheers
Ed
sixtyfive02 wrote:
I fount also 65GZ032. Any comments on this? DMIPS? Code density? Is the code available for free?
On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.
On this point:
GARTHWILSON wrote:
Electric_Eye here is working on a 16/32-bit version of the 6502 in an FPGA. My proposal for the 65Org32, an all-32-bit 6502, is described in this lengthy (9-page) topic.
I coded up the 16-bit version, using Arlet's 6502 core as a base. I've implemented that core in an FPGA module, and TeamTempest and BitWise have both ported their 6502 assemblers to target the core. Electric_Eye has taken a fork of the core and started work on a spartan6 development board. I don't believe it would be much work to create the 32-bit variant (which would actually be simpler.)
sixtyfive02 wrote:
You seem to suggest me to go for a 65816. Is there any RTL code for free?
Cheers
Ed
- BigDumbDinosaur
- Posts: 9428
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: 16 Bit Forth
GARTHWILSON wrote:
You can go further than that and:
- push a literal on the hardware stack in a single instruction with PEA which is "push 16-bit immediate data."
- If you want to fetch the 16-bit number contained at that address and push it on the hardware stack, use PEI.
- PER pushes onto the stack a 16-bit address relative to where the PC is at the moment, with the offset being specified in the operand. This is helpful in writing self-relocatable code.
Consider the case where a subroutine is to print a text string selected by an index value in one of the registers. The first step (after validating the index value) would be to get the appropriate string pointers from a look-up table. With the 65(c)02, you'd store the pointers into a ZP location. With the '816, you could push them to the stack and access the string with LDA (1,S),Y. PEA could push the starting address of the look-up table to the stack, but you'd still be faced with loading the actual pointers into a register in order to prime the stack with the starting address of the string. Hence my code example.
x86? We ain't got no x86. We don't NEED no stinking x86!
BigEd wrote:
Here are some ideas:
- lots of registers - avoids spilling to/from zero page a lot
memory bus width - 4x difference
cache - allows for harvard-like architecture internally, overlap of data and instruction activity
predication - avoids a branch penalty in many cases
more powerful instructions and addressing modes - get more work done for each instruction
Not sure this is pertinent to this thread and I've never tryed to work
out details to know if it really makes sense.
Keep two copies of memory and make writes go to both but allow
reads in parallel. (so you get something like Harvard and possibly
get some of the advantage of registers for zp)
Dedicate a couple of bits in the instruction to jumps/subroutines
so that every instruction is (potentially) a jump or call to a subroutine
or a return. (so you can do something like microcoding without so
much subroutine time penalty or macro space penalty)
(would't necessarily have to be every instruction)
bogax wrote:
Keep two copies of memory and make writes go to both but allow reads in parallel. (so you get something like Harvard and possibly get some of the advantage of registers for zp)
bogax wrote:
Dedicate a couple of bits in the instruction to jumps/subroutines so that every instruction is (potentially) a jump or call to a subroutine or a return. (so you can do something like microcoding without so much subroutine time penalty or macro space penalty) (would't necessarily have to be every instruction)
Cheers
Ed
-
sixtyfive02
- Posts: 7
- Joined: 12 Aug 2011
Quote:
You haven't said what you want to do with it. The 6502 was not intended for multitasking and large applications compiled from high-level languages (which is what Dhrystone is all about); but OTOH, it was simple enough that writing hand-optimized assembly was far more practical on the 6502 than it is on most modern processors. IOW, Dhrystone MIPS may or may not be relevant, depending on what you want to do with it.
I would prefer investing on 6502 (that is in fact what I would like to do since a love this processor for its simplicity) but I don't want to discover after few months that a need more DMIPS to do something that today I don't foresee. Having an ARM7 clone I could even reuse AHB peripherals that I can found on internet speeding-up the assembly and the verification of the FPGA.
BTW, I think to understand your advice: 6502 is better for simple applications when hand made code is required. 65816 would be bettter but no clones are available today in RTL format.
So, the major doubt remain around the DMIPS that I will require according to the specific application but probably I could mitigate this potential problem assembling two 6502 cores into the FPGA. Its gate count is so small that I could fit at least a couple of them.
Thank you!
-
sixtyfive02
- Posts: 7
- Joined: 12 Aug 2011
Quote:
For this, and for pointers to some other variations on the 6502 theme, and for André's own 65k project, have a look at his pages here. His project allows for 32bit, and more besides.
Quote:
On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.
Quote:
I coded up the 16-bit version, using Arlet's 6502 core as a base. I've implemented that core in an FPGA module, and TeamTempest and BitWise have both ported their 6502 assemblers to target the core.
Congratulation for this 16-bit working derivative....!
Cheers
sixtyfive02 wrote:
Quote:
On the question of 6502 gate count, there's a thread here with a summary of 10 implementations of 6502 for FPGA.
(The ones I don't mention in the above paragraph might also be splendid - I don't mean anything by not naming them.)
There's a test suite by Wolfgang Lorenz - it's a good sign to pass that - but correctly responding to RDY and interrupts in all cycles of all instructions is another level of detail. Having precise timing and the right bus activity in each cycle is necessary for games. I collected some test suite info in a wiki page.
Quote:
Congratulation for this 16-bit working derivative....!
Cheers
Ed
I just posted elsewhere a link to an article with a few well written paragraphs on ARM's feature set, performance and code density.
Re: 16 Bit Forth
GARTHWILSON wrote:
BigDumbDinosaur wrote:
The 65C816 has a number of useful stack instructions that are almost tailor-made for languages such as Forth. For example, you can push an address to the stack and then read the contents of that address without touching any zero page memory. First, the 65(c)2 way:
Now, the 65C816 way:
Code: Select all
LDX #<addr
LDY #>addr
STX zpptr
STY zpptr+1
LDY #0
LDA (zpptr),yCode: Select all
REP #%00100000 ;select 16 bit .A
LDA #addr ;load full address in one operation
PHA ;put it on the stack
SEP #%00100000 ;select 8 bit .A
LDY #0 ;index
LDA (1,s),y ;grab byte like LDA (zpaddr),yThe 65k way with its two new registers E and B could be something like:
Code: Select all
LEA addr,Y ; load register E with address "addr" and Y added
LDA (E)
Code: Select all
LDA.W #addr ; word load addr into AC, default is to zero-extend
TAB ; full width transfer to B
LDA (B,0),Y ; use B as base address
André
BigEd wrote:
Thanks for finding the 6502EX - I was unaware of that(*). It's a 32-bit extension, with an 8-bit mode, and lots of registers and operations on 4 byte lanes and some remnant of 64k banks in 32bit mode. (But it seems to be closed source and presumably intended to be licensed for money, so not of direct interest to me.)
sixtyfive02 wrote:
3. No idea if I'm the first to visit 6502EX. For sure this site is new, it has been update recently. I guess that the code will be get under payment but it's not clear. In this case I'm not interested too.
Quote:
The 6502EX is a 32bit extension of the popular 6502 8bit processor. 6502EX source code is available under GNU GPL license
Ed