6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 8:42 am

All times are UTC




Post new topic Reply to topic  [ 15 posts ] 
Author Message
 Post subject: A slow jonesforth port
PostPosted: Sat Sep 30, 2023 5:19 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 109
Someone trapped me in the forth rabbit hole via this article (reader beware: https://ratfactor.com/forth/the_program ... tself.html )

This ended up with me writing a 16-bit 6502-based port of jonesforth. It's super slow (I recently discovered TaliForth is about 3x faster) but has been a great learning experience, and led to some interesting profiling visualization. Work in progress is at https://github.com/patricksurry/forth-6502

One of the main sources of slowness comes from too much overhead in 16-bit pointer manipulation.
For example I think I went too general in using 16-bit data and return stack pointers where a single page each would
likely be enough. What do people typically use for stack size in 16bit forth?

A question that's been bugging me: since the 65(C)02 only has indirect addressing indexed by y, is there any reasonably efficient way to copy between two offsets of the same indirect address? i.e. something which would be similar to
Code:
lda (SP),y
followed by
Code:
sta (SP), x
but obviously that's not legal. all of I can think of is copying the indirect address value into some self modifying code that uses direct indexing but that seems pretty icky.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 30, 2023 8:22 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Welcome.

pdragon wrote:
I think I went too general in using 16-bit data and return stack pointers where a single page each would likely be enough.  What do people typically use for stack size in 16-bit forth?

This is in page 16 of my treatise on 6502 stacks:

    When you know you're accessing the stacks constantly but don't know what the maximum depth is you're using, the tendency is to go overboard and keep upping your estimate, "just to be sure."

    I did this for years myself, and finally decided to do some tests to find out.  I filled the 6502 stack area with a constant value (maybe it was 00—I don't remember), ran a heavy-ish application with all the interrupts going too, did compiling, assembling, and interpreting while running other things in the background on interrupts, and after a while looked to see how much of the stack area had been written on.  It wasn't really much—less than 20% of each of page 1 (return stack) and page 0 (data stack).  This was in Forth, which makes heavy use of the stacks.  The IRQ interrupt handlers were in Forth too, although the software RTC (run off a timer on NMI) was in assembly language.

Quote:
A question that's been bugging me: since the 65(C)02 only has indirect addressing indexed by y, is there any reasonably efficient way to copy between two offsets of the same indirect address?  i.e. something which would be similar to
Code:
lda (SP),y
followed by
Code:
sta (SP), x
but obviously that's not legal.  all of I can think of is copying the indirect address value into some self-modifying code that uses direct indexing but that seems pretty icky.

The values will typically be on the data stack in page 0.  If you need to copy a range of addresses, as in CMOVE or CMOVE>, you move them into N which is a non-indexed part of page 0 and is about 8 bytes of variable space that is only used inside of primitives and is never used to pass data between primitives.  When a primitive finishes, whatever is in N becomes irrelevant.  Putting things there removes the extra cycle of overhead of constantly indexing by X like you have to in the ZP data stack.  BTW, the 65c02 does indeed have indirect addressing without using an index register, so for example you can do LDA (ZP).

You have several things that prompt me to recommend the 6502 stacks treatise which addresses lots of considerations and shows unexpected possibilities.  It's not Forth-specific, but shows how to do things in a very Forth-like way even in assembly language.  It's 19 pages plus appendices.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 30, 2023 11:24 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 109
Nice - thanks!


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 01, 2023 6:11 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
In my BASIC interpreter, in which I'm targeting the NMOS 6502, I'm making use of three tricks to handle this sort of problem:

1: You can set up two zero-page pointers with the required offset between them. A single Y post-index value can then be used with both.

2: You can set X to zero, and use increments of a zero-page address for counting instead of incrementing X (the status flags are set the same way). You can then use pre-indexed indirect as if it were straight indirect. This would be one of the first things to optimise when specialising for 65C02, though.

3: If you use an index register to track the top of a stack with a fixed base, you can perform stack-relative accesses by altering the base address. This works for zero-page as well as absolute indexing, because zero-page indexing wraps within zero page, thus you can specify twos-complement negative base addresses. Both FORTH and my BASIC interpreter use a zero-page operand stack, which can be manipulated conveniently in this way.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 01, 2023 3:35 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 109
ya I think restricting my stacks to a single page each is probably the next thing to do, so I can use an index register as a stack pointer without having to indirect via a 16-bit stack pointer


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 01, 2023 4:18 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Chromatix wrote:
You can set X to zero, and use increments of a zero-page address for counting instead of incrementing X (the status flags are set the same way). You can then use pre-indexed indirect as if it were straight indirect. This would be one of the first things to optimise when specialising for 65C02, though.

I did this in VTL02 to open and close gaps of 128 bytes or less in the program text area while inserting and deleting lines. My obsession with trimming every byte of fat from VTLC02 has gradually shrunk it down to 952 bytes (VTL02C is about 1020), prompting me to consider investing a few bytes here and there to improve performance. An improved PRNG (from Arlet) and a smarter line editor that only moves the tail of the program text once while replacing existing lines are near the top of the to-do list. I also want to see about back porting the CMOS code back to NMOS, since my recent improvements haven't all been about CMOS instructions and addressing modes. CMOS instructions can still be a deal breaker for some hobbyists, and I don't want to exclude anyone (not even the ceramic package guys with no ROR) if I can help it. :)

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 10, 2023 9:53 pm 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
pdragon wrote:
ya I think restricting my stacks to a single page each is probably the next thing to do, so I can use an index register as a stack pointer without having to indirect via a 16-bit stack pointer


A single page for both may well suffice, with one growing down from the top and the other growing up from the bottom. >R and R> will, for example, then never overflow.

Note that if you are using X-indexed absolute addressing for the data stack / return stack, you can split into parallel byte stacks to save one X increment / decrement, eg:

PLUS:
LDA TL,X
CLC
ADC TL+1,X
STA TL+1,X
LDA TH,X
CLC
ADC TH+1,X
STA TH+1,X
INX
JMP NEXT


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 10, 2023 10:03 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
The data stack kind of has to be in ZP though, because so many cells will be addresses that you'll need to use to fetch/store from/to, with the (ZP,X) addressing mode.  You could keep the data stack in another page and copy addresses to a ZP location (like N mentioned above) for indirection, but I think the penalty in performance and required memory would make it not worth it.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 11, 2023 11:49 am 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
GARTHWILSON wrote:
The data stack kind of has to be in ZP though, because so many cells will be addresses that you'll need to use to fetch/store from/to, with the (ZP,X) addressing mode.  You could keep the data stack in another page and copy addresses to a ZP location (like N mentioned above) for indirection, but I think the penalty in performance and required memory would make it not worth it.


A net four bytes extra memory usage because you have two utility vectors in the zero page plus the stack somewhere in general memory rather than the stack in the zero page is a negligible memory penalty. And a two cycle faster DROP, typically the most frequently executed primitive, pays off some the net clock penalty that exists for memory access op execution overall.

And for memory access primitive execution speed, there are trade-offs, since you get to use (ZP) and (ZP),Y to fetch and store, which are both normally one cycle faster than (ZP,X)*, and which reduces the 16bit access penalty, since "INC D,X : BNE + : INC D+1,X : + ..." costs 8 clocks (more on crossing a page boundary)*, as opposed to 2 for "LDY #1" or "INY", and for fetches beyond C@,"TAY" to postpone the store of the first fetched byte because you are still using the vector sitting on the stack costs another two. So the 14 cycles to move a memory vector is not normally a net 14 cycle cost, it's more like a net 0-2 cycles on 16bit memory ops. For 8bit memory ops, it will be a bigger net cost, but OTOH, for the double memory ops, it's probably a net gain:

TWOFETCH: ; 2@
LDA TL,X ; costs clocks
STA W ; costs clocks
LDA TH,X ; costs clocks
STA W+1 ; costs clocks
DEX
LDA (W) ; saves 1 clock*
STA TL,X
LDY #1 ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves 1 clock*
STA TH,X
INY ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves 1 clock*
STA TL+1,X ; saves clocks on preserving (T,X)
INY ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves one clock*
STA TH+1,X
JMP NEXT

__________________
{* Note: if the (W),Y crosses a page boundary, because the low byte of the base address is $FD to $FF, there is no clock saving in the post-indexed versus pre-indexed address mode, but on the other hand, the "INC T,X : BNE + : INC T+1,X" skips the branch once at a cost of 4 clock cycles, so it just juggles around where the clock cycles are saved.}


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 11, 2023 8:37 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 255
First of all, congratulations. As a member of the "oops.. I fell into a rabbit hole" club myself, I think the journey is the most rewarding part.
pdragon wrote:
One of the main sources of slowness comes from too much overhead in 16-bit pointer manipulation.
For example I think I went too general in using 16-bit data and return stack pointers where a single page each would
likely be enough. What do people typically use for stack size in 16bit forth?
TaliForth2 has a data stack depth of 33 16-bit cells (66 bytes) that lives in zero page, making it super easy/fast to address with (ZP,X)-based addressing that Garth describes. X always holds the top of the stack (which gets lower in value as you add things to the stack, eg. the stack grows upside down in zero page) so the TOS (Top Of Stack) 16-bit cell is always at 0,X and 1,X and the (NOS) Next On Stack is at 2,X and 3,X. Here's the code for DUP to see how it works (Tali2 is an STC forth):
Code:
; ## DUP ( u -- u u ) "Duplicate TOS"
; ## "dup"  auto  ANS core
        ; """https://forth-standard.org/standard/core/DUP"""
xt_dup:
                jsr underflow_1

                dex
                dex

                lda 2,x         ; LSB
                sta 0,x
                lda 3,x         ; MSB
                sta 1,x

z_dup:          rts
The underflow_1 routine makes sure there is at least 1 item on the stack.
The two DEX instructions make room on the data stack (which grows upside down, so we decrement X to make room on the "top" of the stack). Then the bytes are copied into the new location. Because we've already grown the stack, the value to be copied is at 2,X and 3,X.

I've written Forth for reading FAT32-formatted compact flash cards and generally don't use more than 1/4 of the data stack. Tali's data stack depth is technically adjustable, but one of Scot's original design goals was to leave half of zero page free for the user's use.

The return stack uses the 65C02 hardware return stack in page 1 ($0100-01FF), and I doubt I've used more than a quarter of that as well.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 3:14 pm 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
SamCoVT wrote:
Code:
; ## DUP ( u -- u u ) "Duplicate TOS"
; ## "dup"  auto  ANS core
        ; """https://forth-standard.org/standard/core/DUP"""
xt_dup:
                jsr underflow_1

                dex
                dex

                lda 2,x         ; LSB
                sta 0,x
                lda 3,x         ; MSB
                sta 1,x

z_dup:          rts
The underflow_1 routine makes sure there is at least 1 item on the stack. ...


Is it possible to do a slight tweak to check underflow with a branch?

So you have 66 slots, from the bottom of the stack area to the top of the stack area, for a stack growing down:
BOSA: [#66L][#66H][#65L][#65H]...[#2L][#2H] TOSA: [#1L][#1H]

Now, AFTER the "DEX : DEX", the X should be indexing #2 or lower. So if you have underflow, it will be indexing #1 or somewhere above the stack. So if X is indexed relative to "DS", which is the top slot, then after X decrements, if it is not a negative value, it has an underflow. So:

Code:
; ## DUP ( u -- u u ) "Duplicate TOS"
; ## "dup"  auto  ANS core
        ; """https://forth-standard.org/standard/core/DUP"""
xt_dup:
                dex
                dex
                bpl + ; underflow
 
                lda DS+2,x         ; LSB
                sta DS,x
                lda DS+3,x         ; MSB
                sta DS+1,x
z_dup:          rts
+               jmp underflow


Then in the case that you don't underflow, the underflow test costs 2 cycles, rather than 11 cycles for the call and return and whatever it costs to do the test itself.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 3:57 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 255
BruceRMcF wrote:
Is it possible to do a slight tweak to check underflow with a branch?
Yes, but...

Tali "natively" compiles short words (eg. it will copy the assembly for the word into the word you are defining) and it has an option to strip the underflow check when doing so. There are variables you can set to control the max size of a word that will be natively compiled and also whether to strip the underflow checking. The JSR is always exactly 3 bytes on words that have underflow checking, so it's easy to strip off, and the size and speed penalty can be eliminated if you opt to strip the underflow checking while compiling. I usually leave the underflow checking on while developing new words, but will turn it off if I need the extra speed (I haven't really run into a situation where I have been concerned about the memory usage, as Forth is generally quite compact for a given functionality).

I do like your method though, from a general speed perspective, and it would probably be worth doing if Tali didn't have the ability/option to strip off the underflow check when native compiling.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 8:15 pm 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
SamCoVT wrote:
BruceRMcF wrote:
Is it possible to do a slight tweak to check underflow with a branch?
Yes, but...

Tali "natively" compiles short words (eg. it will copy the assembly for the word into the word you are defining) and it has an option to strip the underflow check when doing so. There are variables you can set to control the max size of a word that will be natively compiled and also whether to strip the underflow checking. The JSR is always exactly 3 bytes on words that have underflow checking, so it's easy to strip off, and the size and speed penalty can be eliminated if you opt to strip the underflow checking while compiling. I usually leave the underflow checking on while developing new words, but will turn it off if I need the extra speed (I haven't really run into a situation where I have been concerned about the memory usage, as Forth is generally quite compact for a given functionality).

I do like your method though, from a general speed perspective, and it would probably be worth doing if Tali didn't have the ability/option to strip off the underflow check when native compiling.


Aha, that makes more sense if it can be avoided ... though I'd still be sorely tempted to trade off the size against the speed of pre-checking, with DS set to the first address above the stack, of "CPX #0 : BMI + : JMP UNDERFLOW : + ..." and strip off those seven bytes if natively compiled.

I'm used to eForth / Camel Forth style overflow and underflow checking at return to the outer interpreter -- indeed, the xForth for the Commander X16 which I partly finished a year or two ago was largely a port of CamelForth.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 26, 2023 5:05 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
pdragon wrote:
ya I think restricting my stacks to a single page each is probably the next thing to do, so I can use an index register as a stack pointer without having to indirect via a 16-bit stack pointer


I've written many Forth programs, and a single page data stack coupled with a single page return stack is plenty. In fact, half of page zero data stack is usually more than enough. When that isn't enough I usually have a bug or bad design.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 27, 2023 4:17 pm 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
Martin_H wrote:
pdragon wrote:
ya I think restricting my stacks to a single page each is probably the next thing to do, so I can use an index register as a stack pointer without having to indirect via a 16-bit stack pointer


I've written many Forth programs, and a single page data stack coupled with a single page return stack is plenty. In fact, half of page zero data stack is usually more than enough. When that isn't enough I usually have a bug or bad design.


I'd strongly agree with that for a 16bit Forth. 64 cells on the stack should be plenty, unless an implementation has floating point words and is keeping floating points values on the data stack ... and if you have floating point words, a separate floating point stack is so handy that the answer is "don't keep floating points on your data stack".


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: