6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed Sep 25, 2024 4:25 am

All times are UTC




Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Jan 24, 2011 1:17 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
While exploring a stack-based technique for mechanically translating Forth code to efficient instruction sequences capable of exploiting more complex addressing modes given the right circumstances for the 65Org32 project, I realized that accumulator architectures can be every bit as powerful as RISC designs, and about 50% more dense.

I considered the following infix instruction:
Code:
a[i] := b[j+3];

which when expanded as an equivalent Forth sequence becomes:
Code:
a i @ cells + b j @ 3 + cells + @ swap !


Through a mechanical translation process (sorry, no code yet; still deriving the rules on paper/whiteboard), the AVR instruction stream came to:
Code:
; I count 25 instructions, consuming 54 bytes.
; If my memory of timings is correct (not likely, but...)
; I anticipate code execution taking 35 cycles.
ldi r0,0            ; constant zero used several times
ldi zl,low(i)       ; a i @
ldi zh,high(i)
ldm r2,z+
ldm r3,z
add r2,r2           ; cells
adc r3,r0
adiw r2,a           ; +
ldi zl,low(j)       ; b j @
ldi zh,high(j)
ldm r4,z+
ldm r5,z
adiw r4,3           ; 3 +
add r4,r4           ; cells
adc r5,r0
ldi zl,low(b)       ; + @
ldi zh,high(b)
add zl,r4
adc zh,r5
ldm r6,z+
ldm r7,z
mov zl,r2           ; swap !
mov zh,r3
st  r6,z+
st  r7,z

while the 65816-equivalent came to:
Code:
: Code size: 24 bytes
; Cycles: 37
; Just a weeee bit slower than the AVR,
; assuming accesses to i, j, a, and b
; use long absolute addresses, the
; obvious worst-case configuration.
LDA i               ; a i @
ASL                 ; cells +
TAX                 ; b j @
LDA j
CLC                 ; 3 +
ADC #3
ASL                 ; cells
TAY                 ; + @
LDA b,Y
STA a,X             : swap !

Both instruction streams run in similar amounts of time. However, note that the 65816 code takes somewhat less than half the memory space to encode, for all AVR instructions consume 16 bits of program space (or 32 in the case of ADIW).

I've learned that good quality, high performance code for CISC architectures is available through mechanical translation; the trick is to defer code generation as long as you can on the premise that things like sums or simple shifts can be accounted for through address generation logic in the subsequent instruction. You don't "flush" the code generation cache until it can be proven that you cannot accumulate logic into more complex addressing modes any more without losing data. In effect, you reconstruct the original program's parse tree just enough to generate the most efficient instruction at that time.

Also, one must learn to let go of the notion that registers hold values, and instead treat them as caches for memory-resident values instead.

Perhaps, in time, I can write a more detailed blog article or two on this topic. In the mean time, I'm still trying to figure out some corner cases. It was really quite interesting to see how one could use a compile-time stack to keep track of accumulated program generation state and the resulting binaries possible. The real eye opener, though, was how the CISC 65816 was able to keep up with the RISC AVR platform while consuming half the code space.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 24, 2011 1:33 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8517
Location: Southern California
very nice!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 24, 2011 7:02 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
If I use avr-gcc -O2 to translate this:
Code:
    a[i] = b[j+3];

I get this:
Code:
   lds r30,i
   lds r31,(i)+1
   lsl r30
   rol r31
   subi r30,lo8(-(a))
   sbci r31,hi8(-(a))
   lds r26,j
   lds r27,(j)+1
   lsl r26
   rol r27
   subi r26,lo8(-(b))
   sbci r27,hi8(-(b))
   mov r29,r27
   mov r28,r26
   ldd r24,Y+6
   ldd r25,Y+7
   std Z+1,r25
   st Z,r24


18 instructions, 44 bytes.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 24, 2011 8:20 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Comparing a C compiler against a human who hasn't internalized the hundreds of AVR instructions seems a bit disingenuous. :)

Anyway, my point wasn't to incite an AVR-vs-65xx war (though, in the interests of disclosure, I do still note that the 65816 code is still smaller by a long shot!). It was, rather, to demonstrate that rule-based code transformations can result in good quality code for CISC architectures.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 24, 2011 8:26 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
kc5tja wrote:
Comparing a C compiler against a human who hasn't internalized the hundreds of AVR instructions seems a bit disingenuous. :)


I know.. :) I was just providing a reference. Anyway, it could be an option to use C as a target language, and run the output through a C compiler.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 03, 2011 3:35 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
I like the goal here, the idea that address generation logic does some of your arithmetic in order to reduce the number of instructions (and clock cycles). And the observation about code size is certainly thought-provoking. On a CPU that offers a small number of potential operations, it's natural that each instruction will be small. Sometimes less is more!

But how many instructions are required to get the job done? I must say the 65xx is shown to best advantage by this coding example, which fits -- just barely! -- into the 65xx register set. OTOH, if this example is typical of what most programs have to deal with, then an optimal balance has been struck.

Straying from the topic slightly, the 65816 sequence can be optimized further by eliminating the CLC instruction. Presumably the preceding ASL has already cleared Carry. (If the ASL set Carry you're SOL no matter what.) I wonder if a compiler would've exploited this!

cheers,

Jeff

Code:
: Code size: 24 bytes
; Cycles: 37
; Just a weeee bit slower than the AVR,
; assuming accesses to i, j, a, and b
; use long absolute addresses, the
; obvious worst-case configuration.
LDA i               ; a i @
ASL                 ; cells +
TAX                 ; b j @
LDA j
CLC                 ; 3 + <---------- Carry already clear at this point
ADC #3
ASL                 ; cells
TAY                 ; + @
LDA b,Y
STA a,X             : swap !


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Feb 05, 2011 8:24 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Arlet wrote:
I know.. :) I was just providing a reference. Anyway, it could be an option to use C as a target language, and run the output through a C compiler.


Not for my purposes. Remember, this is a project intended to study instruction set design choices in the hopes of engineering an acceptable instruction set for the 65org32 architecture. The primary question addressed here is, what comprises the optimal balance between compiler complexity and instruction set complexity?

Both the accumulator and pure-RISC approaches hold value, but they are ideal for very different purposes.


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 05, 2011 8:30 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Dr Jefyll wrote:
Straying from the topic slightly, the 65816 sequence can be optimized further by eliminating the CLC instruction. Presumably the preceding ASL has already cleared Carry. (If the ASL set Carry you're SOL no matter what.) I wonder if a compiler would've exploited this!


If the compiler can prove at compile-time that the high-bit of the word fetched would be clear, then yes. A static type engine sophisticated enough to express this is quite doable; consult some of the more modern functional programming languages for examples of how type systems influence program correctness. Particularly Haskell, a language I've had the pleasure of using in the past with very good experiences. Other than Oberon, I've only rarely found languages where if it compiles cleanly, it has a better than 90% chance of working as intended first time.[1]

The difficulty, however, comes in the form of complexity versus benefit. My toy compiler lacks typing of any kind right now because it's an implementation detail that only serves to obscure my primary goal. But, were I to implement such a type engine, I would need to come up with a domain-specific language to express, "This variable may contain the set of all positive natural numbers." Semantics checking logic will be needed as well, so that binary operators like +, *, and so on will require implementation and debugging effort.

_________________________-
1. Before anyone objects, note that I'm stating a thesis here, that the technology required for high-quality, high-expressivity typing engines exists, which can be used to implement the kinds of optimization logic detailed above. I cited both Oberon and Haskell, but Java's type engine would work as well, albeit quite verbosely. Note that neither Oberon, Haskell, nor Java actually implement types like I'm describing out of the box; please don't confuse my thesis with an assertion that Haskell, et. al. can express types like these and exploits them accordingly. However, PL/I does implement a type system sophisticated enough to express this. Whether PL/I compilers take advantage of this in practice is unknown; I don't recall a PL/I compiler for the 65xx architecture.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Feb 05, 2011 9:10 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
kc5tja wrote:
Not for my purposes. Remember, this is a project intended to study instruction set design choices in the hopes of engineering an acceptable instruction set for the 65org32 architecture. The primary question addressed here is, what comprises the optimal balance between compiler complexity and instruction set complexity?

Both the accumulator and pure-RISC approaches hold value, but they are ideal for very different purposes.


Aha, I see.. I didn't realize that. That's already a tough question, even if you don't add 'compiler complexity' to the mix of parameters.

I once designed a simple CPU that used 4 stacks. It basically worked as a accumulator + X/Y register design, but every register was in fact implemented as a small stack. So, you could write something to the X register, use it, and then get the old X register value back. The advantage is that you have compact instruction set (only 2 bits needed for register encoding), but you still have plenty of room for temporaries. Unlike single stack designs, there's also no need for stack manipulation instructions, which simplifies the hardware design a lot.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 10, 2011 2:35 pm 
Offline

Joined: Wed Oct 27, 2010 1:28 pm
Posts: 25
Location: Valladolid (Spain)
kc5tja wrote:
I considered the following infix instruction:
Code:
a[i] := b[j+3];

which when expanded as an equivalent Forth sequence becomes:
Code:
a i @ cells + b j @ 3 + cells + @ swap !

Code:
... lots of code ...

: Code size: 24 bytes
; Cycles: 37


I don't know much about Forth, but it looks like a quite unreadable language. I assume you are just moving data from one array into another, and this can be done much faster on an AVR than on any 6502:
Code:
   ldd R0,Y+3
   st  Z,R0

4 bytes, 4 cycles (for 8-bit transfer)
In the AVR you have 3 16-bit pointers and lots of general purpose registers. Just put them into use instead of storing variables in ram. BTW, one great thing about C ,that also gets lots of criticism, are pointers.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Feb 10, 2011 9:52 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8517
Location: Southern California
Quote:
I don't know much about Forth, but it looks like a quite unreadable language. I assume you are just moving data from one array into another, and this can be done much faster on an AVR than on any 6502:
Code:
   ldd R0,Y+3
   st  Z,R0

4 bytes, 4 cycles (for 8-bit transfer)
In the AVR you have 3 16-bit pointers and lots of general-purpose registers. Just put them into use instead of storing variables in ram. BTW, one great thing about C ,that also gets lots of criticism, are pointers.

Like the situation with any spoken language, Forth will be incomprehensible to someone who does not speak that language. Other languages may seem more comprehensible because of similarities to the one(s) you already speak. For example, I speak Spanish and some French, so I understand much of what is said when I hear Portuguese. (Actually I learned to read and write in Spanish before I did in English.)

Forth is a totally different approach than C, and it opens up a lot of possibilities you don't have in C. It had the tools for object-oriented programming even before it was called that. The compiler itself is open to the user, such that you can define and re-define virtually anything-- even operators. Developing bug-free code is far quicker in Forth than in C. When I found Forth, I was so happy with it that I felt like I was done collecting languages. I did not want to use punctuation-laden algebraic languages anymore. For the things I initially thought Forth was a bit handicapped in, I gradually found that it not only could do them, but that it offered very elegant solutions that do not initially meet the eye. C's pointers are a lot more limiting than the situation with Forth is.

As for moving arrarys, the 65816 has a pair of block-move instructions which take several cycles to set up and then 7 cycles per byte thereafter; so for example you can move 10,000 bytes in 70,000 clocks. Part of the reason it takes 7 clocks per byte is that it is interruptible, and the the 65 family's outstanding interrupt response speed is preserved. No RAM locations are used in the control of the block move.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 11, 2011 4:53 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8395
Location: Midwestern USA
GARTHWILSON wrote:
As for moving arrarys, the 65816 has a pair of block-move instructions which take several cycles to set up and then 7 cycles per byte thereafter; so for example you can move 10,000 bytes in 70,000 clocks. Part of the reason it takes 7 clocks per byte is that it is interruptible, and the the 65 family's outstanding interrupt response speed is preserved. No RAM locations are used in the control of the block move.

And if you're really in a hurry, kill IRQs while MVN/MVP are doing their thing.

BTW, in the M/L monitor I wrote for my POC unit, I use MVN/MVP to do the grunt work of the T (transfer memory) operation. The performance is really impressive, especially now that I accidentally discovered I can run the thing with a 15 MHz Ø2 clock. :)

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 11, 2011 8:25 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
jesari wrote:
I don't know much about Forth, but it looks like a quite unreadable language.


That's because you don't know Forth. I apologize if this sounds pompous, but I'm not trying to be; I'm simply stating a fact. Forth legibility comes with experienced use of the language, just as Chinese legibility comes with experienced use of that language.

Quote:
I assume you are just moving data from one array into another,


Yes and no. Yes, I'm moving data. No, that's not my primary objective.

Moving data is trivially easy. Any 4004 can do it. But that doesn't mean the 4004 has an ideal instruction set for using a HLL with.

I want the CPU to have to work at least a little bit, or else what's the point of my research?

Quote:
and this can be done much faster on an AVR than on any 6502:
Code:
   ldd R0,Y+3
   st  Z,R0

4 bytes, 4 cycles (for 8-bit transfer)
In the AVR you have 3 16-bit pointers and lots of general purpose registers.


Those pointer registers need to have their addresses computed. Your code completely ignores that, while mine is predicated on it.

Quote:
Just put them into use instead of storing variables in ram.


"Just," he says. ;)

First, this problem is an NP-complete problem, and is totally unsolvable in polynomial time by any compiler known to man today. Inasmuch, registers are best treated as a cache for RAM-resident values. This not only makes the problem solvable, it also is plenty fast enough for nearly all types of applications written in nearly all types of languages.

Second, by using registers as a cache for values in RAM, mapping them to stack positions in between basic block boundaries, you achieve precisely what you recommend -- keeping the most relevant values in registers as they're needed. Between basic blocks, however, you will need to flush and resync the stack image back into RAM (just as C has to).

Third, you're making a pretty big assumption about the i and j variables by treating them as procedure-local variables, while my code makes no such assumptions. In fact, when I wrote the experiment, i and j were, in fact, globals.

Quote:
One great thing about C ,that also gets lots of criticism, are pointers.


That's because pointers are a major source of errors in real-world programming. Hate to say it, but it's true. I'm a professional QA and platform test engineer, and have been doing this for much more than a decade. Trust me on this.

(Footnote: Oberon, Modula-2, and Pascal all have pointers as well; however, nobody seems to complain nearly as much about them. This comes largely from these language's much, much, much stricter type checking.)

It's also non-sequitor. I'm not here to make a language that doesn't use pointers. In fact, array notation is pointer notation by any other name. And, in Forth, pointers are all you have. @ is the "fetch" operator in Forth.

Again, and as stated several times earlier in the thread, the goal of the code is not to be super-slick. It's not to be the most efficient compiler for an existing architecture possible. It's not even intended to compete. It should be self-evident that a one-man proof-of-concept can never compete with a compiler toolchain with more than a decade of development by hundreds of thousands of contributors.

It's intended to explore the close relationship between both code generation and instruction set architecture. Nothing more. I believe I've already made this patent, above.


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 11, 2011 3:27 pm 
Offline

Joined: Wed Oct 27, 2010 1:28 pm
Posts: 25
Location: Valladolid (Spain)
kc5tja wrote:
Those pointer registers need to have their addresses computed. Your code completely ignores that, while mine is predicated on it.

kc5tja wrote:
Third, you're making a pretty big assumption about the i and j variables by treating them as procedure-local variables, while my code makes no such assumptions. In fact, when I wrote the experiment, i and j were, in fact, globals.


So, now I understand that a,b,i, and j are addresses. I still think it is better to have pointers instead of indices and variables stored in registers instead of RAM, but lets analyze it in this way.
I must say your AVR/65816 comparison is not fair. The AVR has to deal with 16-bit values and addresses, while the 65xx has everything as 8-bit (and still it spend about the same amount of time!!) Try to address more than 256 bytes of memory on 65xx processors and you'll find it is a pain in the neck. The 65xx will perform much worse in that case.
The AVR code you posted can be further optimized. ARLET shown that the GCC compiler can reduce the code size by 10 bytes. GCC isn't too smart: you can remove the two MOV instructions by doing the arithmetic on the pointer registers (r28:r29) directly. At the end the AVR code is only 40 bytes and it executes in 24 cycles. Now, try to do the same, I mean: using 16-bit values and adresses, on a 65xx processor.

kc5tja wrote:
First, this problem is an NP-complete problem, and is totally unsolvable in polynomial time by any compiler known to man today.

I don't know what are you talking about. a[i]:=b[j+3]; surely is a solvable problem for any compiler...
Quote:
That's because pointers are a major source of errors in real-world programming. Hate to say it, but it's true. I'm a professional QA and platform test engineer, and have been doing this for much more than a decade. Trust me on this.

I don't think this argument can be used as an excuse to "forbide" them. But I also understand these days software programmers are always in a rush to get something new on the market and they don't care much about the efficiency of the code -having huge amounts of memory to waste and multi-core super-scalar processors with 3 GHz clocks the user won't notice it- while, on the other hand, having a bug-free program is still mandatory (or should it be?).

And, BTW, I don't know Forth. I never went further than "2 3 + .". I must recognize I don't like stack-based languages. But, analyzing the assembler code you posted I think "cells" means "multiply by word size". "swap" exchanges the 2 values on the top of stack and "!" writes back something. All this mess instead of "a[i]=b[j+3]". Do you understand why I don't like Forth?


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 11, 2011 4:39 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8395
Location: Midwestern USA
jesari wrote:
I must say your AVR/65816 comparison is not fair. The AVR has to deal with 16-bit values and addresses, while the 65xx has everything as 8-bit (and still it spend about the same amount of time!!) Try to address more than 256 bytes of memory on 65xx processors and you'll find it is a pain in the neck. The 65xx will perform much worse in that case.

The W65C816S has selectable register widths, so the programmer can operate it in 8 bit mode where most efficacious and switch to 16 bit mode where word loads and stores would be more efficient.

Operating with 16 bit registers can address 64K of contiguous RAM with simple indexing. That is, if I code:
Code:
REP #%00110000          ;16 bit .A, .X & .Y
LDA #$12AB
STA SOMEWHERE,X

or:
Code:
SEP #%00100000          ;8 bit .A
REP #%00010000          ;16 bit .X & .Y
LDA #$12
STA SOMEWHERE,X

I'm able to store anywhere from SOMEWHERE to SOMEWHERE+$FFFF. The "penalty" of a 16 bit load or store vs. 8 bit is an extra clock cycle. Hardly a performance hit, or a cause of discomfort to any part of the anatomy. :)

I agree working beyond the range of one page with the 8 bit members of the 65xx family is more involved. Also, a 16 bit load or store has to be broken down to a sequence of 8 bit operations (they are 8 bit processors, after all). However, it's no big deal to an experienced programmer who knows the 65xx assembly language.

BTW, at 20 MHz, the '816 can manage between 2.8 million and 10 million instructions per second. Also, its interrupt latency is very low, which is a critical matter in any real time system. Can the AVR exceed that?

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron