Hashes instead of names in dictionary

enso1 · Post by **enso1** » Fri Mar 28, 2025 7:15 pm

Some time ago I built a proof-of-concept x86 Forth which does not store word names, but instead, 32-bit FNV1a hashes. This makes the dictionary more regular, searches faster, and is overall satisfying.

The con: you lose the actual ASCII names, unless you keep them elsewhere. However, this is surprisingly rarely needed, and I keep a symbol file for debugging.

For 6502, a 32-bit FNV1a is a bit of an overkill, and I am looking at simpler hashes. DJB2 works surprisingly well with just shifts and adds. 32-bits is hardly necessary, in fact 16 bits do not collide on the entire Tali dictionary.

I've seen discussions here about a hashed dictionary structures, but has anyone else experimented with keeping name-hashes and discarding names?

BruceRMcF · Post by **BruceRMcF** » Fri Mar 28, 2025 7:55 pm

enso1 wrote:

Some time ago I built a proof-of-concept x86 Forth which does not store word names, but instead, 32-bit FNV1a hashes. This makes the dictionary more regular, searches faster, and is overall satisfying.

The con: you lose the actual ASCII names, unless you keep them elsewhere. However, this is surprisingly rarely needed, and I keep a symbol file for debugging.

For 6502, a 32-bit FNV1a is a bit of an overkill, and I am looking at simpler hashes. DJB2 works surprisingly well with just shifts and adds. 32-bits is hardly necessary, in fact 16 bits do not collide on the entire Tali dictionary.

I've seen discussions here about a hashed dictionary structures, but has anyone else experimented with keeping name-hashes and discarding names?

To make searches faster, you could prefix the name with an 8bit DJB2 hash, since the low byte of a DJB2 is going to be dominated by the trailing letters in the name.

Actually, if 24 bits is going to be enough to avoid collisions, you could prefix the full 24bit DJB2 (little endian) with an 8bit DJB2 on the length and first three characters, which was (after all), the full name part of the header for some early Forths, so it will likely have really good rejection on the first byte for a lot of Forth primitives.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Mar 28, 2025 7:56 pm

I believe this topic has come up in the past.

If I were doing it and was wanting to keep the ASCII word names intact, I’d consider use of an ordered list with pointers being processed with a binary search.

enso1 · Post by **enso1** » Fri Mar 28, 2025 11:40 pm

Well, the topic I found was about using a hashtable or a tree for the dictionary.

What I propose (and implemented) is keeping the dictionary but instead of names keep hashes. Discard the names. This fixes the size of the headers, so we can just keep them in an array (if we keep them separately), and searches are fast and simple to implement, and the headers take very little space.

Just hash the input token, and search for the hash.

Like I said earlier, there were no collisions with a 16-bit DJB2 with all of TaliForth words...

I like linear dictionaries.

resman · Post by **resman** » Sat Mar 29, 2025 6:03 pm

Are you going to pre allocate the dictionary? How big will it be?

I do believe having a hash table is quite useful for improving lookup speeds.

I was looking at LISP implementations for the 6502 when I was writing my LISP for PLASMA when I came across a modern-ish one that tried to use a hash to improve function lookups. They wrote it in C (cc65) and took a traditional approach to a hash function (shifting and adding all the characters in the string). They clearly weren't used to the performance hit between a modern processor and a 6502 executing compiled C code. They couldn't understand why using a hash function *slowed down* their lookup. You could probably do a linear search of half the function names in the time it took to calculate the hash!

When implementing PLFORTH, I used a pretty traditional dictionary format because I did want to keep the names around for SEE. But a linear search through the dictionary was going to be a bit painful in PLASMA. To improve the lookup, a small 64 entry hash table and a super simple hash function improved the lookup performance immensely. Even with a bunch of user words, the average was about two to three collisions per hash. The distribution wasn't even perfect, just okay.

Code: Select all

def hashname(chars, len)#1
  return (len ^ ((^chars << 1) ^ ^(chars + len / 2) << 2)) & HASH_MASK
end

So the length, first character and middle character of the name allows for a hash calculated in constant time followed by an efficient quick reject string search gets a match (or not). Actually used the original search routine, just traversing the next-hash-link field instead of next-dictionary-entry field.

This only took an additional 64 entry hash table, a hash-link per dictionary entry and just a few minutes to implement. Tuning the hash function took a bit more time, but the implementation was trivial and easily added to a traditional dictionary.

Anyway, not saying your approach isn't good, just providing another data point using a more traditional dictionary approach.

enso1 · Post by **enso1** » Sat Mar 29, 2025 9:56 pm

True, a hashtable is hugely better than a linear search -- although I kind of like how you can "shadow" thing in a dictionary... In Forth, the dictionary search time is not much of an issue, generally.

DJB2 is n*33+c on every character, or n+(n<<5)+c, so it is pretty fast. It's kind of interesting that for every new 8 bits it shifts out 5 old bits! I have to look at the function above a little more closely...

BruceRMcF · Post by **BruceRMcF** » Sun Mar 30, 2025 4:24 pm

enso1 wrote:

True, a hashtable is hugely better than a linear search -- although I kind of like how you can "shadow" thing in a dictionary... In Forth, the dictionary search time is not much of an issue, generally.

DJB2 is n*33+c on every character, or n+(n<<5)+c, so it is pretty fast. It's kind of interesting that for every new 8 bits it shifts out 5 old bits! I have to look at the function above a little more closely...

I can see why the middle character is better for the hash than the 2nd one or the last one ... because a number of words in the dictionary will end in ] ! @ + and similar abbreviations, and some words will have a common start, like UM* and UM/MOD and UM/.

But what is especially clever is that you get the index with a LSR of the length byte, so it's less work. And "chars + len/2" means that for two letter words, "chars 2/2 +" is the second letter, so there is a unique hash among all one and two letter words defined, and for a for an 8bit or smaller hash, the whole hash generation only requires four shifts ... dividing LEN by 2, one 8bit shift of the first character, and two 8bit shifts of the middle character.

I'd assume that the distribution is decent for a 6bit hash, but for a 3bit hash, I'd take the 2nd-4th bits rather than the bottom three to get contributions in two bits from the middle character.

And if space is a priority over speed, it's straightforward in Forth. Not tested, but I think it's something along the lines of:

Code: Select all

HEX
: MYHASH64 ( a u1 -- u2 ) \ generate a 6bit hash
\ Uses Length, first char shifted once, and middle char shifted twice
\ This version gives nul strings a hash equal to their address
\ It could also abort on a nul string
    ?DUP IF
        2DUP 2/ + C@ 2* 2* >R \ (2<<middle_char)
        SWAP CHAR+ C@ 2* XOR \ (Length)XOR(1<<1st_char)
        R> XOR  \ Cell wide HASH, at most 10bits.
    THEN 3F AND ;
DECIMAL

resman · Post by **resman** » Mon Mar 31, 2025 8:55 pm

BruceRMcF wrote:

I can see why the middle character is better for the hash than the 2nd one or the last one ... because a number of words in the dictionary will end in ] ! @ + and similar abbreviations, and some words will have a common start, like UM* and UM/MOD and UM/.

But what is especially clever is that you get the index with a LSR of the length byte, so it's less work. And "chars + len/2" means that for two letter words, "chars 2/2 +" is the second letter, so there is a unique hash among all one and two letter words defined, and for a for an 8bit or smaller hash, the whole hash generation only requires four shifts ... dividing LEN by 2, one 8bit shift of the first character, and two 8bit shifts of the middle character.

I'd assume that the distribution is decent for a 6bit hash, but for a 3bit hash, I'd take the 2nd-4th bits rather than the bottom three to get contributions in two bits from the middle character.

And if space is a priority over speed, it's straightforward in Forth. Not tested, but I think it's something along the lines of:

Code: Select all

HEX
: MYHASH64 ( a u1 -- u2 ) \ generate a 6bit hash
\ Uses Length, first char shifted once, and middle char shifted twice
\ This version gives nul strings a hash equal to their address
\ It could also abort on a nul string
    ?DUP IF
        2DUP 2/ + C@ 2* 2* >R \ (2<<middle_char)
        SWAP CHAR+ C@ 2* XOR \ (Length)XOR(1<<1st_char)
        R> XOR  \ Cell wide HASH, at most 10bits.
    THEN 3F AND ;
DECIMAL

I actually recoded it in assembly for a little speed boost when reading & compiling source files. Very simple and like I mentioned, the distribution is okay. I originally used the first and last characters of the name. The distribution was very lumpy - as you surmised, we like to pick names that have either a common beginning or ending. I didn't try on larger or smaller hash tables, as the 64 entry fit the bill for my memory vs. speed tradeoff.

Note: PLASMA uses a split stack:

Code: Select all

asm hashname(chars, len)#1
        LDA     ESTKL+1,X
        STA     SRCL
        LDA     ESTKH+1,X
        STA     SRCH
        LDA     ESTKL,X
        LSR
        TAY
        LDA     (SRC), Y
        ASL
        ASL
        EOR     ESTKL,X
        STA     ESTKL,X
        LDY     #$00
        LDA     (SRC),Y
        ASL
        EOR     ESTKL,X
        AND     #63           ; HASH_MASK
        INX
        STA     ESTKL,X
        STY     ESTKH,X
        RTS
end

BruceRMcF · Post by **BruceRMcF** » Tue Apr 01, 2025 5:13 am

resman wrote:

... I actually recoded it in assembly for a little speed boost when reading & compiling source files. Very simple and like I mentioned, the distribution is okay. I originally used the first and last characters of the name. The distribution was very lumpy - as you surmised, we like to pick names that have either a common beginning or ending. I didn't try on larger or smaller hash tables, as the 64 entry fit the bill for my memory vs. speed tradeoff. ...

Yes, but I am going to aim be working with an ANS Forth style wordlist stack, so a 64 entry hash table is 128 bytes in the dataspace per wordlist definition.

But thinking about it, if I avoid a "SMUDGE" based CREATE and have a full separate byte of available word typing bits rather than just IMMEDIATE and COMPILE-ONLY, so there are three spare bits at the top of the LENGTH byte, I could split a 6bit hash into a 3 bit part for an 8 entry hash table and a 3bit part to OR into the high bits of the LENGTH bits to get faster rejection while searching the wordlist. An 8-entry hash table per wordlist is just 16 bytes of dataspace per wordlist definition.

But that is the hypothetical 65816 Forth ... the one I am working on now is is going to aim for a different trick, which is a kind of module system for compiling low level support words or data structures for a task into the 8KB window in the CX16 which gives access to the 512KB-2MB High RAM space. The high level word/words in the dictionary in main RAM then "opens" a module to bring the RAM bank with the support code into the High RAM bank window, stacking the current bank number in the Rack, and before exiting will (indeed, must) close the module to restore the RAM bank to what it was before.

The idea is that with the open and close module words as immediate compile-only words, then the open module can bring the module words into the search order and bring the correct High RAM bank into view at compile-time at the same time as it compiles the operation to bring the correct High RAM bank into view at execution time, and the close module can remove the module from the search order and restore the High RAM bank while it compiles the operation to restore the High RAM bank during execution time. The search order manipulation only happens at compile time, so speed in that process is not of the essence, and the setting and restoring of the High RAM banks at execution time is very quick.

Inspired by User variables, rather than store the bank number itself, the module definer stores the offset from the base of module binaries, so the base and the number of banks acquired for module code is enough information to store it all in a binary file, and it can then be reloaded into anywhere in High RAM that has enough space, so a "save image" process with two binary image files can work, and after the main image is loaded, if it finds the module image variable set, it can ask for the HighRAM bank to load the module image into.

It's not an "all bells and whistles" module system, but I think it can be done fairly compactly so that it saves far more dictionary space in the main RAM than it costs.

And by avoiding having a lot of little specialized support words in the main dictionary, it speeds up the search by the old fashioned technique of "having fewer things to search through".

Hashes instead of names in dictionary

Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary

Re: Hashes instead of names in dictionary