32-bit Forth on the 65c816 - 6502.org

6502.org Forum

Projects Code Documents Tools Forum

32-bit Forth on the 65c816

19 posts

1
2
Next

Dr Jefyll: Posts: 3526; Joined: 11 Dec 2009; Location: Ontario, Canada; Contact:
Contact Dr Jefyll

Website

32-bit Forth on the 65c816

Quote

Post by Dr Jefyll » Tue Feb 16, 2010 3:45 am

Just throwin' this out there:

We all know the 65c816 has broken the 64K memory barrier. That opens the door to some serious, "big computer" applications.

But, assuming we prefer to code in Forth, the question arises of how an '816 Forth implementation ought to access the wide open spaces above $FFFF. For the sake of discussion I decided to volunteer my $.02 worth, accompanied by some equally valuable code snips. (Corrections & comments welcome -- I'm new to the '816.)

The Forth for my KimKlone computer dealt with the 16 MByte challenge by including @ and ! variants that used 32-bit stack items -- doubles -- to receive the long (aka Far) 24-bit addresses. I had words called FAR@ and FAR! not to mention FAR2@ FAR2! FARC@ FARC!. These were reasonably fast, but a pain to code with due to the constant intermingling of 16- and 32-bit values on stack. Also there was a lot of dictionary clutter from all the double- and mixed-precision operators that turned out to be required.

What I propose here is a different tradeoff -- friendlier to code with, and a more powerful Forth overall. There's a major clock-cycle penalty, but modern '816s are pretty quick... You'll have to choose for yourself what's best. (But address calculations will involve extended precision no matter what.)

In the '816 Forth I envisage, stack items always occupy 32 bits, and @ and ! use Long addresses and access 32-bit data. But the Dictionary resides within 64K, and tokens are 16 bit. It's probably easier to show than to explain, so here are some key code snippets I've scoped out. Take a look and see if this doesn't start to get intriguing!

CONDITIONS:
Double-indirect interpreter
Y is the Forth IP; S is the ParameterStack Ptr.; R_PTR is the ReturnStack ptr.
The P-Stack and the the Dictionary reside in Bank 0.
Addresses on stack are 24-bit (padded to 32 bits).
The P-stack and R-Stack grow and shrink in 4-byte increments

Code: Select all

FETCH:  PLX         ;adrLo in X       LS 2 bits of adr must =0 (ie, DWORD aligned)
        PLB         ;adrHi in DBR
        PHB         ;keep S word-aligned
        LDA  2,X    ;fetch hi word from DBR:X+2
        STA  1,S    ;hi word to stack
        LDA  0,X    ;fetch lo word from DBR:X
        PHA         ;lo word to stack
       _NEXT

Code: Select all

 STORE: PLX         ;adrLo in X        LS 2 bits of adr must =0 (ie, DWORD aligned)
        PLB         ;adrHi in DBR
        PHB         ;keep S word-aligned
        PLA         ;
        PLA         ;lo word from stack
        STA  0,X    ;store lo word to DBR:X
        PLA         ;hi word from stack
        STA  2,X    ;store hi word to DBR:X+2
       _NEXT

Code: Select all

  NEXT: LDX  0,Y     ;5~  fetch via IP
        INY          ;2~  bump IP
        INY          ;2~
        JMP (0,X)    ;6~  X is Forth's "W"

Code: Select all

BRANCH:  CLC          ;
         TYA          ;IP in A
         ADC  0,Y     ;add offset at (IP)
         TAY          ;IP in Y
        _NEXT
0BRANCH:
         PLA          ;lo word of TOS
         ORA  0,S     ;tested with hi word
         PLA          ;drop hi word
         BEQ  BRANCH  ;result 0? Branch.
         INY          ;Else step
         INY          ; over offset
        _NEXT

Code: Select all

NEST:   TYA           ;IP in A
        STA [R_PTR]   ;copy IP to R  (Bank per 3rd byte of ptr, eg: 0)
        TXY           ;copy W to Y
        INY           ;
        INY           ;W+2 will be new IP
        LDA  R_PTR    ;adjust R Ptr to point to next free loc'n
        SEC
        SBC  #4
        STA  R_PTR    ;R_Ptr indicates next free loc'n
       _NEXT

Code: Select all

UNNEST: LDA  R_PTR    ;adjust R Ptr
        CLC
        ADC  #4
        STA  R_PTR
        LDA [R_PTR]   ;copy R to A  (Bank per 3rd byte of ptr, eg: 0)
        TAY           ;copy A to "IP"
       _NEXT

Code: Select all

DUP:    LDA  3,S    ;fetch hi word
        PHA         ;store hi word
        LDA  3,S    ;fetch lo word
        PHA         ;store lo word
       _NEXT

Code: Select all

PLUS:   CLC         ;
        PLA         ;
        ADC  3,S    ;compute lo result
        STA  3,S    ;save lo result
        PLA         ;
        ADC  3,S    ;compute hi result
        STA  3,S    ;save hi result
       _NEXT

Code: Select all

SWAP:   TSX         ;
        PHY         ;save Y (the Forth IP)
        LDA  1,X    ;fetch lo word of TOS
        LDY  5,X    ;fetch lo word of NOS
        STA  5,X    ;store to NOS
        STY  1,X    ;store to TOS
        LDA  3,X    ;fetch hi word of TOS
        LDY  7,X    ;fetch hi word of NOS
        STA  7,X    ;store to NOS
        STY  3,X    ;store to TOS
        PLY         ;restore Y
       _NEXT

dclxvi: Posts: 362; Joined: 11 Mar 2004

Quote

Post by dclxvi » Thu Feb 25, 2010 3:33 am

This strikes me as optimizing a special case. In Forth code I've seen (and written) FAR@ and friends are hardly ever used -- only in a very small percentage of applications, and then only once or twice in those applications. Doubling the width of cell, would result in a bulkier, slower Forth, since all words would used double size cells, the overwhelming majority of which aren't FAR@ & co.

However, in applications that contain a mix of single and double size cells, it would be more convenient to have double size cells. (In those situations, I always seem to define words like 3DUP ( a b c - a b c a b c) or 2XOR (the XOR equivalent of D+).) Since this is the general case of the above, it's more common (or at least, less rare) of course. Garth once suggested using a separate stack for this sort of situation:

viewtopic.php?t=493

One thing I mentioned in the e-mail exchange with Garth was a concept I call a ghost stack (at the time, I hadn't come up with a name for it). The idea is that the data stack uses single width cells, as usual, but you have a separate stack that uses the same stack pointer. For example, if 1,S is the TOS, then say, $81,S would be the TOS of the ghost stack, so + might be:

Code: Select all

   PLA
   CLC
   ADC 1,S
   STA 1,S

then G+ would be:

Code: Select all

   PLA
   CLC
   ADC 1,S
   STA 1,S
   LDA $7F,S ; back up 2 since this is after the pull
   ADC $81,S
   STA $81,S

One of the big downsides is that the ghost stack will often contain uninitialized data (since single cell data doesn't write anything to the ghost stack), so if you have: ($xxxx represents uninitialized code here)

Code: Select all

 data  ghost

$ABCD  $xxxx  top of stack
$5678  $1234

and you thought you had:

Code: Select all

 data  ghost

$5678  $1234  top of stack
$ABCD  $xxxx

you could get intermittent behavior, and/or a bug that's very difficult to track down.

I'm not entirely certain that a ghost stack is an ugly hack, but I'm pretty sure.

kc5tja: Posts: 1706; Joined: 04 Jan 2003

Quote

Post by kc5tja » Thu Feb 25, 2010 8:13 am

Using the ghost stack offers no advantages over using explicitly typed operators (e.g., M+, D+, 2XOR, 2OVER, etc.). In the ideal world, you really would have one stack per data type in your program.

If you find yourself routinely dealing with quantities larger than 16 bits, then a 32-bit Forth might be in order. Just be prepared for the performance hit.

In the Forth I wrote for my Kestrel environment, I used a 16-bit wide stack, but had dedicated accessors for different banks. @ and ! operated as you'd expect, but I also had F@ and F! for access to the video frame-buffer, which sat in another bank.

As I learn more and more about software engineering, I'm finding myself increasingly fond of using relational tables and block storage for such. To that end, if you're looking to access beyond the 64K, you certainly have a reason to do so, and odds are likely you're not touching only one datum that happens to span multiple banks!

So, one approach to working around this is to implement @ and ! for the home bank, and then (I'm just making up names here) B@ and B! for a "bank" fetch/store operation. A global variable BANK would be used to identify which bank B@ and B! operated on. Each bank, then, stores 64 blocks worth of data. Using B@ and B! eliminates the need to copy data around, but in all other ways, B@ and B! behave just like @ and !.

Anyway, that's just an idea. Still, the 32-bit wide stack isn't a problem if you don't particularly care about run-time performance. And, realistically speaking, you should profile your code and convert hot-spots to 65816 primitives anyway.

cr1901: Posts: 158; Joined: 05 Feb 2014

Re: 32-bit Forth on the 65c816

Quote

Post by cr1901 » Fri Dec 05, 2014 10:10 pm

I feel bad (partially) for reviving a 5 year old thread, but I'm mulling over this exact topic right now.

From what I understand, Forth requires the implementation to access data using addresses which are the same size as the basic cell. This is fine for most processors, but for banked/segmented archs (namely, 65816, 8088, and 286), and more generally archs where address bus width != data bus width, this falls apart.

All solutions provided here sound appealing- I think I'd rather take the code complexity hit and keep performance up by using a natural-width cell for the '816. I don't see the '816 gracefully handling 64-bit data (but who knows- didn't the 8088 have to when paired with an FPU?).

Ideally, at least for data, I'd like to abstract away the banks completely, so cross-bank data is handled correctly (using a MVN/MVP instruction). Dr. Jeffyl's idea of having far versions of !/@ sounds closest to what I want.

Not sure if I can abstract away cross-bank jumps in an '816 Forth (meaning if I wanted to use more than one '816 bank, I'd be forced to manually code it into my Forth programs, adding an environmental dependency), but if anyone has any ideas, I'm all ears!

EDIT: Also, A and X/Y size switching. I suspect there's performance and space reasons to do so during character accesses and large sections of code that don't require 16-bit A X/Y.

BigDumbDinosaur: Posts: 9428; Joined: 28 May 2009; Location: Midwestern USA (JB Pritzker’s dystopia); Contact:
Contact BigDumbDinosaur

Website

Re: 32-bit Forth on the 65c816

Quote

Post by BigDumbDinosaur » Fri Dec 05, 2014 11:52 pm

cr1901 wrote:

I feel bad (partially) for reviving a 5 year old thread, but I'm mulling over this exact topic right now.

From what I understand, Forth requires the implementation to access data using addresses which are the same size as the basic cell. This is fine for most processors, but for banked/segmented archs (namely, 65816, 8088, and 286), and more generally archs where address bus width != data bus width, this falls apart.

All solutions provided here sound appealing- I think I'd rather take the code complexity hit and keep performance up by using a natural-width cell for the '816. I don't see the '816 gracefully handling 64-bit data (but who knows- didn't the 8088 have to when paired with an FPU?).

Ideally, at least for data, I'd like to abstract away the banks completely, so cross-bank data is handled correctly (using a MVN/MVP instruction). Dr. Jeffyl's idea of having far versions of !/@ sounds closest to what I want.

Not sure if I can abstract away cross-bank jumps in an '816 Forth (meaning if I wanted to use more than one '816 bank, I'd be forced to manually code it into my Forth programs, adding an environmental dependency), but if anyone has any ideas, I'm all ears!

EDIT: Also, A and X/Y size switching. I suspect there's performance and space reasons to do so during character accesses and large sections of code that don't require 16-bit A X/Y.

First the disclaimer: IANAFE (I Am Not A Forth Expert). However, I do know a bit about using the 65C816 to manipulate data.

Quote:

From what I understand, Forth requires the implementation to access data using addresses which are the same size as the basic cell. This is fine for most processors, but for banked/segmented archs (namely, 65816, 8088, and 286), and more generally archs where address bus width != data bus width, this falls apart.

It would seem that the best way to handle this requirement would be to internally pad addresses to a constant size of 32 bits, even though the '816 is limited to 24 bit addressing. I say this because 24 bit processing with the '816 is cumbersome, whereas 32 bit processing is relatively simple to implement.

Quote:

All solutions provided here sound appealing- I think I'd rather take the code complexity hit and keep performance up by using a natural-width cell for the '816. I don't see the '816 gracefully handling 64-bit data (but who knows- didn't the 8088 have to when paired with an FPU?).

The '816 handles 64 bit data reasonably well—I use 64 bit arithmetic in my integer math routines with little effort.

Quote:

Ideally, at least for data, I'd like to abstract away the banks completely, so cross-bank data is handled correctly (using a MVN/MVP instruction). Dr. Jeffyl's idea of having far versions of !/@ sounds closest to what I want.

MVN and MVP make sense if a large amount of data is to be moved—the copy rate of 7 cycles per byte is indeed rapid when compared to a conventional load/store loop. However, that performance advantage disappears when copying small blocks of data, as the setup code needed to use MVN and MVP has to be executed with every use, and becomes a significant part of the total execution time of the copy operation. Also, MVN and MVP modify DB, which means it may be necessary to preserve DB prior to initiating the copy. Another factor to consider is that the operands to MVN and MVP specifiy the target and source banks (respectively), hence are part of the executable code. You have to overwrite those operands with your banks as part of the setup, which means you are implementing self-modifying code.

Alternatives to using MVN and MVP are the [<dp>] and [<dp>],Y addressing modes, which efface bank boundaries. The segmented architecture of the '816 is essentially a limitation on programs (PB is not incremented when PC is wrapped), not data structures, assuming use of proper programming methods. These addressing modes eliminate having to tinker with DB, since the address at <dp> includes A16-A23.

Quote:

Not sure if I can abstract away cross-bank jumps in an '816 Forth (meaning if I wanted to use more than one '816 bank, I'd be forced to manually code it into my Forth programs, adding an environmental dependency), but if anyone has any ideas, I'm all ears!

It would make sense for the Forth kernal to occupy the first non-zero bank (i.e., bank $01), leaving plenty of space in bank $00 for direct pages and stacks. That arrangement would leave plenty of room above the kernel for dictionaries, transient data, etc., as RAM would be contiguous from $010000 to the top of installed memory.

Quote:

EDIT: Also, A and X/Y size switching. I suspect there's performance and space reasons to do so during character accesses and large sections of code that don't require 16-bit A X/Y.

I've noted a tendency for 6502 programmers who are new to the '816 to stick with what they know, rather than embrace the "'816 philosophy." In much of my code I leave the registers set to 16 bits and only go to eight bits when a specific function requires byte-at-a-time processing, e.g., display I/O. Specifically, there is no performance penalty in implied register operations when they are set to 16 bits, as all ALU operations in the '816 are 16 bit operations. That is to say, INX always consumes two clock cycles, whether .X and .Y are set to 8 bits or 16 bits.

x86? We ain't got no x86. We don't NEED no stinking x86!

Dr Jefyll: Posts: 3526; Joined: 11 Dec 2009; Location: Ontario, Canada; Contact:
Contact Dr Jefyll

Website

Re: 32-bit Forth on the 65c816

Quote

Post by Dr Jefyll » Sat Dec 06, 2014 12:58 am

cr1901 wrote:

I feel bad (partially) for reviving a 5 year old thread, but I'm mulling over this exact topic right now.

No apology needed! What you did made sense if you're mulling over this exact topic.

In the lead post I suggested a 32 bit Forth Virtual Machine implemented using 65c816. The premium is on fast, easy programming, with execution speed explicitly assigned secondary priority. The reason I want a 32-bit VM is so my code doesn't suddenly become a PITA every time I need an address or data word exceeding 16 bits. How good could such a VM be, and how far can the performance loss be minimized? (A 32-bit VM is not the only option; there are other remedies, and no one answer is best for all scenarios.)

cr1901 wrote:

Dr. Jeffyl's idea of having far versions of !/@ sounds closest to what I want.

Having FAR versions did work. True, it's a minor nuisance having a bilingual word set (FAR- and non-FAR versions of otherwise identical words). More important, though: what I regret is not declaring that stack space should always be allocated & deallocated 32 bits at a time (even though 2 of the allocated bytes may go unused). One option is to use a ghost stack for the MS word, which reduces the amount of DEXing and INXing you do as the stack grows and shrinks.

cr1901 wrote:

[...] addresses which are the same size as the basic cell [...] but for banked/segmented archs (namely, 65816, 8088, and 286), and more generally archs where address bus width != data bus width, this falls apart.

Maybe it just means you need a bigger cell size. That's the suggested premise -- solve the problem in the most straight-forward way.

There's one compromise I find tolerable, since its impact is low but its performance benefit is large. A 64K limit on the dictionary drastically speeds up token fetching and NEXT. The 16-bit-ness of tokens is largely a matter for the complier, and has no bearing on the majority of user-written code.

@BDD; just saw your post: I see we agree on the 32-bit padding thing. Good comment, too, re the (in-)efficiency of using MVP and MVN. But I'm not sure what cr1901 meant by saying, "so cross-bank data is handled correctly (using a MVN/MVP instruction)." I hope it doesn't imply copying data from FAR space to normal space as a prelude to acting on the data. The data should be acted on in situ. (The space should be big enough that you can just leave the data where it is.) Maybe I misunderstood the remark.

cheers,
Jeff

In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

GARTHWILSON: Forum Moderator; Posts: 8775; Joined: 30 Aug 2002; Location: Southern California; Contact:
Contact GARTHWILSON

Website

Re: 32-bit Forth on the 65c816

Quote

Post by GARTHWILSON » Sat Dec 06, 2014 1:23 am

cr1901 wrote:

I feel bad

Don't.

Quote:

From what I understand, Forth requires the implementation to access data using addresses which are the same size as the basic cell.

A cell normally needs to be at least as many bits as you use in addresses, so any relevant address will fit in a single cell.

Quote:

EDIT: Also, A and X/Y size switching. I suspect there's performance and space reasons to do so during character accesses and large sections of code that don't require 16-bit A X/Y.

REP and SEP (which, IMO, should normally be used in macros with descriptive names, like ACCUM_16 or INDEX_8, as REP and SEP themselves are so painfully cryptic) take 3 clocks each-- not a lot. For my '816 Forth however, I mostly leave A in 16-bit and X & Y in 8-bit, and there are very few places where they're changed from that.

Quote:

It would seem that the best way to handle this requirement would be to internally pad addresses to a constant size of 32 bits, even though the '816 is limited to 24-bit addressing. I say this because 24-bit processing with the '816 is cumbersome, whereas 32-bit processing is relatively simple to implement.

I would concur.

http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

cr1901: Posts: 158; Joined: 05 Feb 2014

Re: 32-bit Forth on the 65c816

Quote

Post by cr1901 » Sat Dec 06, 2014 1:26 am

Dr Jefyll wrote:

More important, though: what I regret is not declaring that stack space should always be allocated & deallocated 32 bits at a time (even though 2 of the allocated bytes may go unused).

Atm, I only have time to ask about this/will parse your posts entirely later, but...

I know Forth is a stack machine and ideally, you operate only on the top elements of the stack, but wouldn't allocating 2-cells worth of data for each stack push screw up programs that expect to be able to increment by the "cell size" manually and find the next data item on the stack? It sounds like if you took char pointers in C and made pointer increment add two to the character address instead of one

. I don't think a number of programs would like that.

Dr Jefyll: Posts: 3526; Joined: 11 Dec 2009; Location: Ontario, Canada; Contact:
Contact Dr Jefyll

Website

Re: 32-bit Forth on the 65c816

Quote

Post by Dr Jefyll » Sat Dec 06, 2014 2:01 am

cr1901 wrote:

wouldn't allocating 2-cells worth of data for each stack push screw up programs that expect to be able to increment by the "cell size" manually and find the next data item on the stack?

Careful not to confuse two different things -- namely, Forth source code ("high level") written to run on the VM, and assembly-language (low level) 65xx code that runs "under the hood" to provide the illusion of the VM. When you mention, "increment by the cell size manually and find the next data item on the stack," that is under-the-hood stuff which indeed needs to be dealt with when the VM implementation is being written (in assembler). But that's of less concern, since you only write the implementation once -- or maybe someone else did it. True, you may occasionally choose to write a new word of your own in assembler but it's optional.

The long-term payoff is when you're writing high level code to run on the VM. This happens many times, not once! And you enjoy the benefit of being able to write clean, easy code that hides the 65xx's actual workings from you. Your Forth source code only has to be written in terms which the VM's interpreter understands. A simple VM makes for productive high-level Forth coding. In fact it's possible that same source code might run unchanged on a 16-bit VM, provided that 16 bits is adequate for the application and the VMs are sufficiently similar. (That'd have to be checked, unless both adhere to one of the Forth standards.) Transporting high-level code the other way (16-bit platform to 32-bit) is also possible.

ETA-

Quote:

It sounds like if you took char pointers in C and made pointer increment add two to the character address instead of one

.

Your high-level code would go 1 cells + or similar. On a 16-bit VM cells executes as a multiplication by two. On a 32-bit VM it executes as a multiplication by four. So, your high-level code is unaffected by what lies underneath -- you don't worry about it

In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

barrym95838: Posts: 2056; Joined: 30 Jun 2013; Location: Sacramento, CA, USA

Re: 32-bit Forth on the 65c816

Quote

Post by barrym95838 » Sat Dec 06, 2014 2:44 am

Dr Jefyll wrote:

... Your high-level code would go 1 cells + or similar. On a 16-bit VM cells executes as a multiplication by two. On a 32-bit VM it executes as a multiplication by four. So, your high-level code is unaffected by what lies underneath -- you don't worry about it

Code: Select all

                 ; 1035 ;--------------------------------------------------------- CELLS
00000318:00000313; 1036         NOT_IMM 'CELLS'
00000319:0543454c; 1036 
0000031a:4c530000; 1036 
                 ; 1037 _cells: ; ( n -- n' )  Cells->address units
0000031b:0000031c; 1038         PRIMITIVE
0000031c:5c060053; 1039         jmp  next_

Mike

Dr Jefyll: Posts: 3526; Joined: 11 Dec 2009; Location: Ontario, Canada; Contact:
Contact Dr Jefyll

Website

Re: 32-bit Forth on the 65c816

Quote

Post by Dr Jefyll » Sat Dec 06, 2014 3:00 am

Oops -- thanks, Mike. You've reminded me to mention that my "times 2" and "times 4" comments apply when the underlying machine addresses bytes -- unlike yours, obviously! And of course byte addressing, though sometimes handy, is hardly a necessity.

Since it does nothing (multiplies "times 1"), your code for cells has only one optimization possible. When cells appears in the source code have it defined to do nothing then, not at run time. IOW don't even compile a jump to NEXT.

J

In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

BigDumbDinosaur: Posts: 9428; Joined: 28 May 2009; Location: Midwestern USA (JB Pritzker’s dystopia); Contact:
Contact BigDumbDinosaur

Website

Re: 32-bit Forth on the 65c816

Quote

Post by BigDumbDinosaur » Sat Dec 06, 2014 6:28 am

Dr Jefyll wrote:

cr1901 wrote:

Dr. Jeffyl's idea of having far versions of !/@ sounds closest to what I want.

Having FAR versions did work. True, it's a minor nuisance having a bilingual word set (FAR- and non-FAR versions of otherwise identical words). More important, though: what I regret is not declaring that stack space should always be allocated & deallocated 32 bits at a time (even though 2 of the allocated bytes may go unused). One option is to use a ghost stack for the MS word, which reduces the amount of DEXing and INXing you do as the stack grows and shrinks.

Again, IANAFE. I think that a consideration is this discussion is the relative importance of memory usage versus flexibility. Hypothetically, a 65C816 system can be built with 16MB of RAM, with essentially all of it flat address space for data. A program, of course, cannot "officially" span banks, so that means a 64KB limit. In a Forth system, the "program" is the Forth kernel, which from the little bit of investigation I've done, seems to handily fit in a single bank with plenty of elbow room. All of this has some interesting implications that tie into raw performance.

24 bit addressing always requires extra clock cycles. So it would make sense to give the kernel a dedicated bank for its internal data tables/structures and workspace and normally leave DB pointing at that bank. Some memory will go to waste, as (again IANAFE) that usage appears to be modest. Here we are trading resource efficiency for performance, as all reads and writes on RAM can be expressed with 16 bit addresses.

The word dictionary could be started in the next bank and allowed to span as many banks as needed to hold the dictionary and any related data structures (pointer arrays, etc.). Any addresses related to words in the dictionary would be expressed as 32 bit little-endian values, which are easily manipulated by the '816. As data structures are not boxed in by bank boundaries, the Forth kernel can write its data almost anywhere.

Aside from the requirement that direct page and the MPU stack be in bank $00, the other key architectural consideration is the requirement that interrupt handlers also be in bank $00. This doesn't mean that an entire interrupt handler has to be in bank $00—a long jump into the Forth kernel's bank could be made as the first instruction of the interrupt handler.

It would seem that since the addressable memory range of an '816 system can greatly vary, the Forth kernel would need to be able to determine how much "extended" RAM is available. Presumably, a system would have at least 512KB, as that is the largest 5 volt SRAM that appears to be currently available. A machine built around one of Garth's DIMMs would have 4 MB. I daresay from Forth's perspective that is a vast amount of core.

Quote:

cr1901 wrote:

[...] addresses which are the same size as the basic cell [...] but for banked/segmented archs (namely, 65816, 8088, and 286), and more generally archs where address bus width != data bus width, this falls apart.

Maybe it just means you need a bigger cell size. That's the suggested premise -- solve the problem in the most straight-forward way.

There's one compromise I find tolerable, since its impact is low but its performance benefit is large. A 64K limit on the dictionary drastically speeds up token fetching and NEXT. The 16-bit-ness of tokens is largely a matter for the complier, and has no bearing on the majority of user-written code.

Yes, a 64KB limit on the dictionary would produce better performance, since 24 bit addressing and the associated clock cycle consumption would be avoided. However, that would seem to imply that the dictionary's location would have to be known prior to run-time—that is, statically assembled into the kernel. Of course, I'm speaking from a certain level of naivety here—I don't really know enough about how the dictionary is internally structured to be clear on the implications of constraining the dictionary to 64KB. Would there be any advantage to allowing the dictionary to span into the following bank(s)? What would be the disadvantages to doing so, aside from the need to address the dictionary with 24 bit instructions?

Quote:

@BDD; just saw your post: I see we agree on the 32-bit padding thing. Good comment, too, re the (in-)efficiency of using MVP and MVN. But I'm not sure what cr1901 meant by saying, "so cross-bank data is handled correctly (using a MVN/MVP instruction)." I hope it doesn't imply copying data from FAR space to normal space as a prelude to acting on the data. The data should be acted on in situ. (The space should be big enough that you can just leave the data where it is.) Maybe I misunderstood the remark.

I'm also not entirely certain what cr1901 meant. I'm with you on the premise that in situ operations are the way to go. I see no profit in shuffling data from place to another just to perform an operation, excepting maybe floating arithmetic, where it is often convenient to have dedicated accumulators (usually) in direct page.

x86? We ain't got no x86. We don't NEED no stinking x86!

GARTHWILSON: Forum Moderator; Posts: 8775; Joined: 30 Aug 2002; Location: Southern California; Contact:
Contact GARTHWILSON

Website

Re: 32-bit Forth on the 65c816

Quote

Post by GARTHWILSON » Sat Dec 06, 2014 9:58 am

Quote:

Yes, a 64KB limit on the dictionary would produce better performance, since 24 bit addressing and the associated clock cycle consumption would be avoided. However, that would seem to imply that the dictionary's location would have to be known prior to run-time—that is, statically assembled into the kernel. Of course, I'm speaking from a certain level of naivety here—I don't really know enough about how the dictionary is internally structured to be clear on the implications of constraining the dictionary to 64KB. Would there be any advantage to allowing the dictionary to span into the following bank(s)? What would be the disadvantages to doing so, aside from the need to address the dictionary with 24 bit instructions?

Hopefully this clear things up more than further muddy the waters.

The "dictionary" basically keeps the definitions of all the words used in the language-- not just the ones that came in the kernel, but the ones the user added later to build his application, too. It is all the executable code; and assuming you compile with headers, there are the names to look up the definitions of. FIND looks up a name in the dictionary and returns its address, whether to compile or to execute, as appropriate at the time. (Programs are always compiled, but you can have it interpret a command line without compiling; and some words have compile-time behavior that is separate from their run-time behavior. It really works out to be extremely flexible.)

Some words of course will need to be defined in assembly. These are called "primitives" or "code definitions." Others can be defined in terms of other Forth words, including primitives. These are called "secondaries" or "colon definitions." The user's code can often get by without defining any more primitives, but the kernel will definitely need many words defined as primitives. With my assembler built in and easy to use, I don't have any resistance to writing primitives as part of an application if appropriate. I'm not interested in portability. With the excuse of portability (but apparently more likely just for the sport of it), there have been countless topics in the magazines and forums in past decades regarding how few primitives you can get by with. I don't think I've seen less than about 30; but the performance is dismal when you go that low. Imagine doing a multiply word defined as a secondary in terms of a severely limited group of primitives. I have hundreds of primitives in mine.

Continuing in the theme of a dictionary and a language, most Forths allow having different vocabularies too, so a word in one vocabulary may not have much resemblance to the same name in a different vocabulary. In the assembler vocabulary for example, AND will lay down the appropriate AND op code and the operand, whereas in the Forth vocabulary, AND takes the bitwise AND of the two top stack cells and returns the result. (I avoided that by using AND#, AND_ZP, AND_abs, etc. in my assembler.) Or in an analogy, think of a ball meaning one thing in the context of baseball, another thing in the context of football, and another thing in the context of a dance. (And, not surprisingly, VOCABULARY and CONTEXT are Forth words.)

Anyway, everything in Forth is a word, whether it's to return a constant, give the address of a variable or array, produce an arithmetic action, compile something, output something, etc., all the way up to your top definitions which basically are your programs. All the words you write go into the dictionary and become part of the language. If you want to be able to find the words again later for further compilation, you'll want headers. Although the headers don't take up as much room as the code, it's not a negligible amount of space either; so I've seen early PC Forths that put the headers in one 64K space, code in another, and data in another. I have not tried it all the different possible ways myself to say much about the pros and cons of the different memory distributions. My own intention, when I get back to working on mine, is to keep all code and headers in bank 0, and use other banks for data only. For my uses, one bank for code, in Forth, is a lot of space.

http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Martin_H: Posts: 837; Joined: 08 Jan 2014

Re: 32-bit Forth on the 65c816

Quote

Post by Martin_H » Sat Dec 06, 2014 12:24 pm

Given the typeless natural of the Forth stack, and the ability to push either the address or contents of a cell, wouldn't a 24 bit cell size be the choice for a 65816?

BigDumbDinosaur: Posts: 9428; Joined: 28 May 2009; Location: Midwestern USA (JB Pritzker’s dystopia); Contact:
Contact BigDumbDinosaur

Website

Re: 32-bit Forth on the 65c816

Quote

Post by BigDumbDinosaur » Sat Dec 06, 2014 7:49 pm

Martin_H wrote:

Given the typeless natural of the Forth stack, and the ability to push either the address or contents of a cell, wouldn't a 24 bit cell size be the choice for a 65816?

Seemingly. However, manipulating 24 bit values with the '816 is awkward and inefficient in many cases. My experience is it's best to work with data sizes that are even multiples of 8 bits: 8, 16, 32, etc. This ties in naturally with the use of 16 bit loads and stores. and especially simplifies the use of R-M-W instructions, which are often called upon to maintain counters.

For example, to increment a 32 bit value, one could write:

Code: Select all

         rep #%00100000        ;16 bit .A & memory
         inc somewhere         ;increment LSW
         bne next
;
         inc somewhere+2       ;increment MSW
;
next     ...program continues...

Now suppose the value stored at somewhere is 24 bits. You'd have to write:

Code: Select all

         rep #%00100000        ;16 bit .A & memory
         inc somewhere         ;increment LSW
         bne next
;
         sep #%00100000        ;8 bit .A & memory
         inc somewhere+2       ;increment MSB
;
next     ...program continues...

Yes, you save a byte of storage in the second case, but you add two bytes to the code, along with three clock cycles each time the SEP instruction has to be executed.

Now consider the case of decrementing a 32 bit value:

Code: Select all

         rep #%00100000        ;16 bit .A & memory
         lda somewhere         ;test LSW
         bne next              ;ignore MSW
;
         dec somewhere+2       ;decrement MSW
;
next     dec somewhere         ;decrement LSW

If the value stored at somewhere is 24 bits, you would have to write:

Code: Select all

         rep #%00100000        ;16 bit .A & memory
         lda somewhere         ;test LSW
         bne next              ;ignore MSB
;
         sep #%00100000        ;8 bit .A & memory
         dec somewhere+2       ;decrement MSB
         rep #%00100000        ;16 bit .A & memory
;
next     dec somewhere         ;decrement LSW

Sixteen bit loads and stores on memory take only one more clock cycle than eight bit loads and stores. Ergo there is a potentially large performance gain to be had by arranging your program to handle data as even multiples of eight bits.

Last edited by BigDumbDinosaur on Sun Dec 07, 2014 8:22 am, edited 1 time in total.

x86? We ain't got no x86. We don't NEED no stinking x86!

Post Reply

19 posts

1
2
Next

Return to “Forth”