6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 10, 2024 9:31 pm

All times are UTC




Post new topic Reply to topic  [ 28 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Wed Apr 02, 2014 1:22 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Hello, Forthers

I am trying to learn Forth by following the eccentric path of hand-translating an ITC Forth to my new processor design. I have learned tons of fascinating details, but I had a question regarding the Header structure.

If I'm not mistaken, it is customary to have it formed as a macro which includes previous link, attribute flags, name length, name, CFA, in that order, followed immediately by the code field, if applicable. The immediate flag (and one or two other attribute bits) can be squeezed into the upper bits of the name length, making it 'non-standard' in the sense of a standard string. That's where my question comes in. Is there such a thing as a 'standard' format for strings in Forth?

The reason that I'm asking is that my 65m32 doesn't address bytes, only 32-bit words, and I don't want to waste 75% of my available bits on strings by following a byte-oriented format. I would like to use a packed ascii format that holds up to four 7-bit chars per word, sets bit 31 if there are any more chars following, and signals the last word of the packed string by clearing bit 31 for just that last word. I am writing (FIND) right now, and although I realize that I don't have to follow any standard for my own amusement, if would be nice to be able to offer a finished product that doesn't irritate and confuse potential users.

Does anyone have an opinion on this method of encoding strings, and is there any 'prior art' along these lines with which you are familiar? If I use this method for finding names, should I just leave it at that, or should I write some other words to deal with packed strings, or should I modify the existing string words to accept packed formats as well as the 'standard' one-char-per-word. I know that I could just ignore NULs and treat unpacked, loosely packed and tightly packed with the same routine, but that doesn't address the length-byte issue.

Thanks in advance for any comments, suggestions or criticisms.

Mike


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 1:35 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
A follow-up:

It occurs to me that some of you (Garth) like to use 8-bit ascii a bit, and that would make my scheme almost useless, so maybe I should consider keeping the leading length word, but still allow up to four chars in each following word.

Mike

P.S. If this topic is not appropriate for a 6502 forum, I apologize, and invite comments to be directed to a similar thread at anycpu.org


Last edited by barrym95838 on Wed Apr 02, 2014 1:57 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 1:55 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Mike:

I would pack the characters 4 8-bit per word. I would use a NULL byte to mark the end of a string like in C.

I might consider implementing a instruction to right shift 8 bits at a time that zero fills. This instruction would support a little endian view of the byte order of a string. The first byte would be in the least significant 8 bits, and so on to the most significant 8 bits. If after three right shifts, the value of the word is 0, then it is the end of the string. Otherwise, process the fourth character of the word and move to the next word. At most, you'll waste three bytes to mark the end of a four byte string. But without a byte addressable memory architecture you've already made that trade and decided that it is acceptable. (You can also implement a left shift by 8 instruction and consider the string data to be big endian instead of little endian.)

The word addressed Data General Nova provided a mechanism like I've described to process 8 bit bytes. Since I've written any code for that architecture, I don't know how they determined the length of strings. Another approach is to use the Pascal approach, and use the first byte as the length of the string. In either case, you will run into situations where you'll have some wasted space in words whenever the length is not 3, 7, 11, 15, etc.

I suppose my recommendation is not to use 7-bit packing. It really does not buy you very much in the way of savings per word, and I expect it to complicate whatever assemblers or compilers you may develop for you processor.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:00 am 
Offline

Joined: Sat Aug 21, 2010 7:52 am
Posts: 231
Location: Arlington VA
C tends to use null-terminated strings. Forth strings (including dictionary words) are traditionally stored with a count byte at the beginning followed by that many characters.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:21 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Thanks, Michael.

The concept of 'endian-ness' is a bit different when all of the ordered elements are in a single cell, but I get your point. I do have 64-bit logical shift instructions (carry is no longer a flag, but instead an entire 32-bit register) that can shift by any reasonable hard-coded count, so it would be less confusing to encode the packed cell with the first char in the high bits. A left-shift by eight would place the next char in the carry register, where it could be processed further, without any wasted additional shifts, and a zero result in the accumulator would indicate that the word contains no more significant chars.

Regarding Pascal vs. C style strings, am I correct in assuming that Forth tends to favor the Pascal style? If I encode the length inside the first char slot of the first word, I'm limited to 256 chars (or should it be words?!?!). If I place the count in a separate pre-fix word, then I'm limited to 2^32-1 chars (or words). Have you encountered any annoyances dealing with strings limited to a rather short length?

Mike


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:32 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
chitselb wrote:
C tends to use null-terminated strings. Forth strings (including dictionary words) are traditionally stored with a count byte at the beginning followed by that many characters.


Thanks for confirming that. Are the dictionary names encoded the same way as Forth strings for a good reason, or just for the sake of consistency?

If I follow the path I have begun, there will be no difference between my @ and C@ , ! and C! , + and D+ , , and C, , etc ... Essentially everything will be 32-bits wide, with the possible exception of (optionally) packed strings. Do any of you find this to be particularly annoying, and would it be a significant portability issue for existing code? Should I leave the possibility open for a 64-bit D+ and relatives?

Mike


Last edited by barrym95838 on Wed Apr 02, 2014 2:35 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:35 am 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
chitselb is correct. The usual way that Forth stores strings is a count byte followed by 1-255 characters. I believe the ANS document refers to this as a "counted string." This length limitation has generally not been a problem.

Within Forth code, you'll see strings referenced two ways: either the address of a counted string ( adr ), or the address and length of the string text ( adr n ). The word COUNT converts the former to the latter.

As I recall, ANS Forth expects characters to be individually addressable, so a character cannot be smaller than an "address unit" (the smallest addressable unit of memory). On a machine that only addresses 32-bit words, that's a bit wasteful.

I did a non-ANS Forth for the Freescale DSP56800 family, for which the address unit (and cell size) is 16 bits. For word names and text strings I packed two bytes to the cell, with null padding if required. I then modified the string comparison in FIND, and the text output from ." , to use the packed format.

FIG-Forth did some tricks with the length byte of a word name. The high bit was set as a marker, as was the high bit of the last character of the name, so that TRAVERSE could find either end of the name string. This also allowed the historical "length plus first three characters" form of name storage. The next-highest bit (0x40), if I recall correctly, was the IMMEDIATE flag, and 0x20 was the SMUDGE bit that made a word temporarily unfindable in the dictionary. (I may have those last two reversed.) This limited word names to 31 characters.

For CamelForth I wanted word names to be in standard string format, so I (a) rearranged the header so that it's never necessary to traverse names backwards; (b) required names to be stored in full in memory; (c) used a different method (unlinking and relinking) to make words temporarily unfindable; and (d) dedicated an entire byte to the IMMEDIATE flag.

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:40 am 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
barrym95838 wrote:
If I follow the path I have begun, there will be no difference between my @ and C@ , ! and C! , + and D+ , , and C, , etc ... Essentially everything will be 32-bits wide, with the possible exception of (optionally) packed strings. Do any of you find this to be particularly annoying, and would it be a significant portability issue for existing code? Should I leave the possibility open for a 64-bit D+ and relatives?


As you have described your machine, making C@ and @ identical, and C! and ! identical, would be ANS compliant. In your system, the character size is the same as the cell size. (ANS refers to the machine's word size as "cell" so as to avoid confusion with the Forth "word" which is an entry in the dictionary.)

However, making + and D+ identical is not ANS compliant. On your machine, D+ should be a 64-bit operation. (This is also consistent with all previous Forth standards. D+ is always a double-cell operation.)

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 2:52 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Thanks, Dr. Brad.

I have been checking out some of your web-published work, and I think that I've seen your header's 'previous link' pointer pointing to the previous 'name length' byte, correct? I'm asking this because I want to keep the number of labels to a minimum in my source, and it seems to me that I should be able to get away with just a label for the CFA. The code field would immediately follow, and the name, previous link and other attributes could go immediately before. There would be a slight bit of wasted time locating the beginning of the name (unless I stored it backwards!), but I'm trying to examine the trade-offs. Waddya think?

Mike


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 3:04 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8539
Location: Southern California
Answering lots of things above, since so many posts were added while I was writing:

It might be best to use the first 8 bits for name length (in bytes, up to 31), smudge bit, and immediate bit, then put one character in each 8 bits after that. If the length is more than 3, it will grab the next 32-bit group, or more that 7, it will grab yet another, etc.. For the last 32-bit group often not being full, it may be advantageous to go left-justified, making it more compatible with other strings and string operations. You may for example want to do some sort of string expression and use its net result as an input for FIND.

The link field would be in the next address, not partly occupying bits that are otherwise unused in the name field. And yes, I definitely like to use the full 8 bits for name characters, to get lots of special characters including Greek ones used in engineering.

Standard format for strings in Forth is that the first byte tells the length, rather than using null termination. I could almost go either way, but I think the counted string may have an edge in the list of advantages. It does streamline the header-making process if normal strings have the same form as the name field (except that you'll have to treat the three high bits).

I have never found any problems with strings being limited to 255 characters, but that may not mean much since my use of Forth is not oriented to human I/O. It would be easy for you to make them able to go to 65535. One of my books was formatted in Forth and they said they made a string able to be long enough to have each page be a string. You'll want to think about how to handle the situation where for example you want to shorten the string by 1 or 2 from the left end. Doing a long series of rotations would not be efficient. Maybe two bits following the length field could tell how many to skip before reading the first character; or, you could just use leading nulls that don't get counted in the length. The Forth way of doing strings allows any character in strings, even nulls. The only ones I can think of that you can't really use in actual names are null, backspace, carriage return, line feed, and space, for various reasons.

Do keep the doubles. They are definitely needed. My potential use for a 32-bit Forth definitely requires that some intermediate results be more than 32 bits.

A C@ could be like for doing LDA VIA1PB, AND#$000000FF, <push on stack> where the VIA is only on 8 bits of the data bus and the other bits' states are meaningless and will just be whatever value was last driven on the bus, held there by bus capacitance.

My own HEADER macro in my '816 ITC Forth, for C32 assembler has:
Code:
HEADER:         MACRO NAME, precedence          ; Lay down name and link fields.

 IF HEADERS? && ! OMIT_HEADERS  ; If HEADERS is true and OMIT_HEADERS is false,
                                ; then go ahead lay down the header.
    last_NFA:   SETL    new_NFA
     new_NFA:   SETL    $
                DFB     precedence | {npc-$-1}, NAME
         npc:                           ; Use this to calc name length above.
        IF      $ & 1                   ; If next addr is odd,
                DFB     0               ; add a 0 byte before you
        ENDI
                DWL     last_NFA        ; lay down the link field.
 ELSE
        IF      $ & 1
                DFB     0               ; Even if headers are not allowed,
        ENDI                            ; you should still align.
 ENDI
                ENDM
 ;-------------------

SETL is "SET Label," assigning a value to as assembler variable. DFB is "DeFine Byte," and DWL is "Define Word, Least significant byte first." (The assembler variables HEADERS? and OMIT_HEADERS allow locally or globally headerless assembly. The link field points to the length first byte of the previous word's name field.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 3:20 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Thanks for posting your header, Garth! I don't quite understand it 100% yet, but I'll study it until I get it. I see that re-using labels with 'SET' can be economical, and I'll probably go that route to point my 'previous link' to the beginning of the previous name field. I'm also going to quit trying to squish my code with all of those 'bra's to a two-instruction NEXT, and just switch to in-lining it ... those hand-calculated relative branch offsets were starting to get on my nerves anyway!

Mike


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 8:56 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Nice idea to allow the strings to start non-justified - but to keep the benefit of allowing any character in a counted-string implementation, there's no room for special values. So as Garth says, you need to find 2 bits for this.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 10:36 am 
Offline

Joined: Tue Jan 07, 2014 8:40 am
Posts: 91
barrym95838 wrote:
I have been checking out some of your web-published work, and I think that I've seen your header's 'previous link' pointer pointing to the previous 'name length' byte, correct? I'm asking this because I want to keep the number of labels to a minimum in my source, and it seems to me that I should be able to get away with just a label for the CFA.


Correct on both counts. The common CamelForth header format is link, immediate flag, name length, name text (i.e. name as a counted string), followed by the body which in ITC is code address, rest of body. Note that the code address field should be considered part of the body, not part of the header. (It's used during execution, and it must be retained even with headerless code.) I do have the link point to the previous word's name (its "name field address"). Starting from that point, it is easy to traverse forwards to the code field (by adding the length+1), and backwards (by fixed offsets) to the link and immediate fields.

From the MSP430 example:
Code:
HEADER  MACRO   asmname,length,litname,action
        PUBLIC  asmname
        DW      link
        DB      0FFh       ; not immediate
link    SET     $
        DB      length
        DB      litname
        EVEN
        IF      'action'='DOCODE'
asmname: DW     $+2
        ELSE
asmname: DW      action
        ENDIF
        ENDM


Only one label is required per word, the "asmname" (the name which is used while writing assembler code). For example, the word whose "litname" (Forth name) is @ has an asmname "FETCH", because special characters aren't permitted in assembler labels. As in:
Code:
        HEADER  FETCH,1,'@',DOCODE
        MOV     @TOS,TOS
        NEXT


The only other address you need to know is the "link" value for the final word, which become the initial value of the LATEST pointer. That's done at the very end of the MSP430 source with
Code:
lastword equ link        ; last word in dictionary

_________________
Because there are never enough Forth implementations: http://www.camelforth.com


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 02, 2014 7:49 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8539
Location: Southern California
I should have included a few examples from my '816 Forth in my post above. As Brad said, the CFA is not part of the header. Whether you use headers or not, every ITC Forth word requires the CFA. What does not require it is runtimes that Forth words' CFAs might point to, as in the case of nest, shown below. Note however that unnest is a Forth word.

There are a handfull of things the CFA could point to. In the case of primitives, it's just the address immediately following the CFA, because that's where the code to run is. In the case of secondaries, the CFA points to nest, a runtime. Other runtimes include (but are not limited to) const (used by constants, and, in my system, even variables because their address is constant), 2const (for double-precision constants), or any other machine-language routine (ending in JMP NEXT, not RTS), as in the case of the word -1 below which just points to PUSH_TRUE, and 0 (zero) which just points to PUSH_FALSE, both of which many primitives jump to to finish up. (I put the PUSH_TRUE and PUSH_FALSE code there too.)

I used a GO_NEXT macro to easily change what gets assembled there just by changing a single equate in the assembly code, although in practice I've left it at JMP(NEXTadr) for my high-level Forth interrupt service which slows things down about 3% (because of the indirection) but the speed of NEXT itself is unaffected, except that if there's an interrupt to service in high-level Forth, NEXTadr contains the address of a different version of NEXT which goes into the ISR, and it's actually faster than the normal NEXT-- almost like a negative overhead for handling interrupts.

At the end below you'll see DUP>R, a primitive to replace the common occurrence of DUP >R but headers are turned off for that one, yet it still starts with PRIMITIVE which lays down the code field that points to the the code immediately following at $+2. You might ask, "Why not just comment-out the HEADER line if you don't want the header?" but HEADER and NO_HEADERS make it quicker to remove or put back the headers through a range of Forth words all at once, without having to comment-out or "de-comment" every one individually.


Code:
Selected examples:

PRIMITIVE: MACRO                        ; The main alternative to PRIMITIVE for
           DWL    $ + CELL_SIZE         ; a code field is DWL nest (or DOCOL ).
           ENDM                         ; CFA is necessary even without headers.
 ;-------------------

CONSTANT:  MACRO   number               ; This compiles a constant.  You still
           DWL     const, number        ; need to use the HEADER macro first if
           ENDM                         ; you want a header.  Other assemblers
                                        ; might allow you to modify this so the
                                        ; assembly-language CFA label name can
                                        ; be included in the same macro call
 ;-------------------                   ; that sets up the header.

VARIABLE:       MACRO                   ; SF168,174,260,262  FIG  F83  ANS_CORE
        DWL     const, THERE            ; THERE refers to the last value left,
 THERE: SETL    THERE + 2               ; not this new updated value.
        ENDM
 ;-------------------

        HEADER "-1", NOT_IMMEDIATE      ; -1 is the Forth word's name.  MINUS1
MINUS1: DWL     PUSH_TRUE               ; is the CFA label for the assembler.


        HEADER "0", NOT_IMMEDIATE       ; Here with -1 , 0 , TRUE , FALSE , and
ZERO:   DWL     PUSH_FALSE              ; DUMMYCELL , making the code field
                                        ; point to PUSH_FALSE and PUSH_TRUE is

        HEADER "1", NOT_IMMEDIATE       ; faster and shorter than an actual
ONE:    CONSTANT 1                      ; Forth constant.


        HEADER "2", NOT_IMMEDIATE       ; These numbers are given as Forth words
TWO:    CONSTANT 2                      ; since these numbers get so much use.
                                        ; Every use of a constant in place of a
                                        ; literal saves a cell (two bytes).

 HEADER "FENCE", NOT_IMMEDIATE  ; Address behind which FORGET can't go. Normally
FENCE:  VARIABLE                ; holds NFA of the first FORGETable word.
                                ; Init'd by COLD.

BASEdata:       EQU     THERE   ; (BASEdata is for primitives' use.)
 HEADER "BASE", NOT_IMMEDIATE   ; number base for number I/O.  Init'd by COLD.
BASE:   VARIABLE                ; (Use BASE in secondaries where you want CFA.)


STATEdata:      EQU     THERE   ; For primitive STATE_@, which saves time & mem.
 HEADER "STATE", NOT_IMMEDIATE  ; -1=compile, 0=interpret.
STATE:  VARIABLE                ; [ in QUIT turns it OFF.



PUSH_FALSE:  DEX_DEX
SET_FALSE:   STZ     0,X
             GO_NEXT
 ;-------------------

nest:   PEI     IP      ; PEI IP replaces LDA IP , PHA here.  nest is the
        LDA     W       ; runtime code of : (often called DOCOL ). It is not
        INA_INA         ; really a Forth word itself per se-- but it is pointed
        STA     IP      ; to by the CFA of secondaries.
        GO_NEXT
 ;-------------------

        HEADER "unnest", NOT_IMMEDIATE   ; ( -- )
unnest: PRIMITIVE                        ; This does the opposite of
        PLA                              ; nest, and the same as EXIT.
        STA     IP                       ; It is often called SEMIS
        GO_NEXT                          ; because it's compiled by ;
 ;-------------------

        HEADER "/", NOT_IMMEDIATE        ; ( n1 n2 -- quot )
SLASH:  DWL     nest, sMOD, NIP, unnest
 ;-------------------

        HEADER "1+", NOT_IMMEDIATE       ; ( n -- n+1 )
_1ADD:  PRIMITIVE
        INC     0,X
        GO_NEXT
 ;-------------------

 NO_HEADERS
        HEADER "DUP>R", NOT_IMMEDIATE   ; ( n -- n )
DUP2R:  PRIMITIVE
        LDA     0,X
        PHA
        GO_NEXT
 HEADERS
 ;-------------------

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 07, 2014 5:54 am 
Offline

Joined: Mon Jan 07, 2013 2:42 pm
Posts: 576
Location: Just outside Berlin, Germany
If I understand this correctly, it seems that counted strings are an older format in Forth that is slowly (and with Forth, that word really means what it says) being phased out. The draft for ANS Forth 200x states:
Quote:
Forth 94 moved toward the consistent use of the “c-addr u” representation of strings on the stack. The use of the alternate “address of counted string” stack representation is discouraged.

Which is why PARSE and PARSE-NAME both return "c-addr u" (that's "address of character string" and "unsigned single-cell number") instead of counted strings like the older WORD.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 28 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: