6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Jun 03, 2024 9:54 pm

All times are UTC




Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Wed Apr 22, 2015 5:19 pm 
Offline

Joined: Sat Aug 21, 2010 7:52 am
Posts: 231
Location: Arlington VA
On another thread, Mike B writes
barrym95838 wrote:
Dr Jefyll wrote:
... "TOS and TOS+1 are at fixed points in ZP." TOS resides someplace other than on stack??? I guess this relates to the split stack you're using. TOS is treated differently than the other stack items so it can be used as an indirect pointer -- is that right?

That's Charlie's deal with his Pettil implementation. It involves some trade-offs. A favorable example: NIP becomes INX. There are pros and cons to any decision like that. For my 65m32 DTC Forth, I keep TOS in the accumulator, but that would be impossible for a 6502, and a dubious optimization for an '802/'816. Perhaps Charlie did some benchmarks that led to his design decision, and would be able to share them briefly with us, here or in a fresh thread?

The idea came from David Wheeler's 6502 page and also here.

Garth has gently suggested that the split stack might not be very good.
GARTHWILSON wrote:
It would take some convincing for me to believe that splitting the stack and having to keep moving things in and out of a fixed-address TOS pair of ZP bytes would be more efficient than keeping byte pairs together in every cell and doing indexing and doing the INX or DEX twice to add or drop stack cells.
There are two 48-byte regions of zero page called 'stackl' and 'stackh', with 'tos' being stored in a separate 2-bytes contiguous region of zero page. The X register is the stack pointer Here are DUP and DROP
Code:
DUP
    dex               ; [2]
    lda tos+1         ; [3]
    sta stackh,x      ; [4]
    lda tos           ; [3]
    sta stackl,x      ; [4]
    jmp next          ; [3]
                      ; [19]
DROP
    lda stackh,x      ; [4]
    sta tos+1         ; [3]
    lda stackl,x      ; [4]
    sta tos           ; [3]
    inx               ; [2]
    jmp next          ; [3]
                      ; [19]

FIG DUP = [23] = 2 + 2 + 4 + 4 + 4 + 4 + 3
FIG DROP = [7] = 2 + 2 + 3


Last edited by chitselb on Wed Apr 22, 2015 7:53 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 22, 2015 7:45 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8453
Location: Southern California
The normal ones (with bytes together) is:
Code:
DUP
    DEX
    DEX
    LDA  3,X
    STA  1,X
    LDA  2,X
    STA  0,X
    JMP  NEXT   ; FIG's DUP is less efficient, having an extra PHA and PLA, and has
                ; to make NEXT come immediately after PUSH to avoid an extra JMP.

DROP
    INX
    INX
    JMP  NEXT   ; (or in the case of STC, just straightline it with only the two INX's (2 bytes & 4 clocks total), with no JMP)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 22, 2015 8:02 pm 
Offline

Joined: Sat Aug 21, 2010 7:52 am
Posts: 231
Location: Arlington VA
Another way to do it would be splitting the stack without a separate 16-bit TOS. As with FIG, words like @ or ! require a copy of tos before using (ZP),Y addressing
Code:
DUP
    dex               ; [2]
    lda stackl+1,x    ; [4]
    sta stackl,x      ; [4]
    lda stackh+1,x    ; [4]
    sta stackh,x      ; [4]
    jmp next          ; [3]
                      ; [21]
DROP
    inx               ; [2]
    jmp next          ; [3]
                      ; [5]


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2015 2:50 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1933
Location: Sacramento, CA, USA
I can't say that I've analyzed every permutation, but when it comes to the good old NMOS 6502, I think that perhaps it would be best to concentrate on which words get executed the most often and optimize them as best you can.

Dr. Brad dropped an interesting little tidbit here:
Quote:
For reference, Phil Koopman did a study of how often Forth primitives get called in some benchmark Forth programs. The top ten were CALL a.k.a. ENTER (12.21%), EXIT (11.74%), VARIABLE (5.46%), @ (5.40%), 0BRANCH (4.78%), LIT (4.54%), + (4.18%), SWAP (3.90%), R> (3.89%), >R (3.87%), CONSTANT (3.68%), and DUP (3.05%). Note that CALL, EXIT, R>, and >R combined add up to about 32% of primitive executions.


I'll forgive him for listing twelve words in the top ten, and also mention that NEXT can be a big cycle hog, depending on implementation and host architecture (free for STC, significantly less free for others, but to varying degrees). I think that Pettil is a strong performer in the NMOS 6502 crowd because of its streamlined NEXT, not because of its split stack and separate TOS, but I am most certainly not a expert in that area, so I'll just leave it at a thought. It's a shame that NIP isn't on that list, though, because Pettil crushes the competition for that word.

A parting thought would be to spend some serious time and energy on coding an optimal ENTER and EXIT (unless we're talking STC, in which case they're just JSR and RTS).

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2015 3:05 pm 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
Oops. I don't know how I miscounted that. :)

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2015 5:01 pm 
Offline

Joined: Sat Aug 21, 2010 7:52 am
Posts: 231
Location: Arlington VA
VARIABLE is third? really? I'm also surprised CONSTANT makes this list


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 23, 2015 5:58 pm 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
The key phrase is "some benchmark Forth programs." Koopman tested a fractal landscape generation program, Conway's game of Life, a floating-point package written in high-level Forth, and the Forth compiler itself. VARIABLE was used almost as much as CALL and EXIT in the Life program. In the Forth compiler, VARIABLE didn't make the top twelve. The results I quoted were the average across all four test programs.

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 2:02 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1933
Location: Sacramento, CA, USA
Brad R wrote:
Oops. I don't know how I miscounted that. :)

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.

Now it's my turn to say "oops". I have read your Moving Forth series enough times to know better, but maybe I'm due for another round.

I have really become enamored with DTC ... it seems to have the right "feel" in the simple work that I've been doing so far. I am not nearly advanced enough to tackle the subjects of optimizing compilers and other more modern techniques, but I wouldn't mind peeking at a few links, if you have any available.

Thanks,

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 2:16 am 
Offline

Joined: Sun Jul 28, 2013 12:59 am
Posts: 235
Brad R wrote:
Oops. I don't know how I miscounted that. :)

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.


So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. Taken branches involve an extra instruction before sync, so even that can be dealt with. Next, use a 16-bit data bus and a MUX for memory access by the CPU. Add another 16-bit pointer for use as a Forth instruction pointer. Allow the microcode to kick off a 16-bit aligned read to load into the instruction pointer in a single cycle. Finally, use instruction substitution or similar to set up a two-cycle instruction that will load the externalized CPU instruction pointer from the memory pointed to by the Forth instruction pointer, also forcing a pre- or post-increment.

It's not quite a one-cycle NEXT, as it takes two cycles, but it should still be a win over RTS/JSR. Requires DTC, though. ITC would be three cycles with this sort of scheme.

(edit: swapped DTC and ITC in the context of cycle counts.)


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 12:52 pm 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
Had I been aware of the concept of a split stack at the time, I would definitely have incorporated it into my forth. However, it would be low on my list of priorities to do now. Since adding or removing numbers from the stack involves only one DEX or INX instead of two (saving 2 clock cycles each time) there is a saving to be had but not enough to justify the effort of getting it to work right in an existing forth.

Having a separate TOS is useful in processors like the x86 where you can dedicate a register solely for that purpose - especially if you can reverse the roles of the data and return stack pointers so that refilling the TOS becomes a simple POP operation.

On a 6502 there is not so much of an advantage. Even the fact that the TOS would be a pointer in its own right is only a marginal advantage. There is definitely an advantage when performing the @ operation but not for the corresponding ! operation because like most forth primitives, the TOS has to be refilled each time the word is executed.

In spite of the "research" listed above, I believe that after NEXT, the words that need the most time saved would be DUP and DROP. The DUP operation would be speeded up (very) slightly with a separate TOS but the DROP operation would be slowed down significantly as shown in the OP. In an STC forth, the difference would be even greater because without a separate TOS, DROP could be coded in-line as INX (or INX INX) with no JSR/RTS overhead.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 5:05 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1933
Location: Sacramento, CA, USA
theGSman wrote:
... Having a separate TOS is useful in processors like the x86 where you can dedicate a register solely for that purpose - especially if you can reverse the roles of the data and return stack pointers so that refilling the TOS becomes a simple POP operation ...

I like all of your points, especially this one. It is a huge advantage in my custom hobby processor design, because I have just enough registers to dedicate the accumulator to TOS, I have auto-increment and auto-decrement, the accumulator can be used as a pointer without penalty, and pushes to the R stack can be combined with a post-load to the same register with no code space penalty. Now I just need to get the dang thing working ...

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 5:26 pm 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
barrym95838 wrote:
I am not nearly advanced enough to tackle the subjects of optimizing compilers and other more modern techniques, but I wouldn't mind peeking at a few links, if you have any available.


Alas, I don't. Many years ago my friend Charles Curley wrote a book on Forth for the 68000 -- never published; I reviewed a draft -- which had a chapter on some optimization techniques (inline expansion, if I recall correctly). Other than that, what I've learned I have picked up from hints dropped in various conference papers. I know there has been some very good work done on optimizing Forth compilers, but it tends to be proprietary.

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 5:28 pm 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
nyef wrote:
So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. ...


Yes, but if you're going to do that, why not just design a Forth CPU?

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 5:47 pm 
Offline

Joined: Sun Jul 28, 2013 12:59 am
Posts: 235
Brad R wrote:
nyef wrote:
So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. ...


Yes, but if you're going to do that, why not just design a Forth CPU?


Because then it would be off-topic for 6502.org?


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 24, 2015 11:34 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1933
Location: Sacramento, CA, USA
nyef wrote:
Because then it would be off-topic for 6502.org?

I can't wait to see how it turns out ... I hope that you name it the 6502.Borg!!

Locutus B.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: