dictionary hashing

GARTHWILSON · Post by **GARTHWILSON** » Fri Mar 25, 2005 11:11 pm

Hmmm... I think that would foul things up for FORGET or MARKER too. Still, let me know if I'm misunderstanding your idea.

kc5tja · Post by **kc5tja** » Sat Mar 26, 2005 5:06 am

GARTHWILSON wrote:

Hmmm... I think that would foul things up for FORGET or MARKER too. Still, let me know if I'm misunderstanding your idea.

Well, obviously, any modifications you make to the dictionary search structure, you'll want to suitably modify FORGET and MARKER too, since it is an intrinsic "method" of the database. Remember: when adding any facility to a system, you always want to CYA. And to do that with information services, you need to think of the following triad: ADD, EDIT, DELETE. :, DEFER, CODE, and CREATE obviously are good ad ADDing things. FORGET/MARKER at DELETE. IS (if DEFER is supported) constitutes the EDIT capability. Some Forths even have a PATCH capability for words which were not defined to be revectorable.

Anyway, for any given information service, ADD, EDIT, and DELETE all need to know the internal structure of the database they're adding to, editing, or removing from, and all need to know how to search for their data. Otherwise, they couldn't do their task. Hence, if you make changes to how things are added or edited, you can bet that you're going to need to adjust the deletion procedures too.

In other words, what I'm trying to say is, "expect it, and prepare for it."

I'm actually undergoing this right now with my own FTS/Forth (a native-booting Forth for x86 PCs!!), being based on RetroForth 8 running under the L4 microkernel. Well, I've had to rewrite pretty close to 66% of the code, because of all the dependencies the code has on the vocabulary and how it handled user input. (When you think about it, there are only three major components to any Forth system: keyboard input (which I'm going to rewrite shortly), video output (already rewritten), and dictionary database (which I've rewritten about half the code for). So yeah, I'm easily approaching a 66% rewrite of the code.

I'm going to end up rewriting everything again soon enough anyway, but right now I'm concentrating getting as much of the original code running as possible. Believe it or not, I actually have about four people waiting for its completion, including (ironically) the author of RF8.

dclxvi · Post by **dclxvi** » Mon Mar 28, 2005 4:26 am

GARTHWILSON wrote:

(This would of course mean the headers all have to be in RAM, not ROM.)

You don't have to put the whole head in RAM, just the link field. For example, assuming that A, B, C, and D, were defined in that order:

Code: Select all

       ORG TOP_OF_RAM-NUMBER_OF_HEADS*4
LINK4: DW  LINK3,D_HEAD
LINK3: DW  LINK2,C_HEAD
LINK2: DW  LINK1,B_HEAD
LINK1: DW  END_OF_LINKS,A_HEAD

In other words, link fields build downward (toward lower addresses) in RAM. You can use MVN or MVP to set up the link fields at cold start. Notice that it doesn't matter whether the heads are ROM or RAM. The head and definition of say : NIP SWAP DROP ; might look something like:

Code: Select all

NIP_HEAD: DB  3,"NIP"                 ; name length, name
NIP:      DW  DO_COLON,SWAP,DROP,EXIT

To handle FORGET or MARKER, just reconstruct the links, which can be done several ways. It may not be particularly fast, but FIND is used far more often than MARKER, so you will usually win big even if FIND is only slightly faster and MARKER is really slow.

Anyway, having the entire dictionary cached does have some interesting properties.

1. You don't have to decide how big to make the cache, or whether you would want to allow it to be resized.

2. FIND will be a smaller, simpler (but not necessarily faster) routine, since it only has to search the cache, not the cache then the dictionary.

3. FIND doesn't have to compare the same word twice. For example, if A is a recently defined word and is in the cache, and B is not recent and not in the cache, then when using a separate cache and searching for B, FIND will usually compare B to A twice, once when searching the cache and once when searching the dictionary.

4. Words that are used occasionally (i.e. they wouldn't usually be found in a separate cache), would not necessarily fall all the way to the end (i.e. last word found) of the dictionary, so you might be able to do without hashing. I should point out that the use of a cache does not depend on any particular dictionary implementation (hashing, linked lists, etc.).

So there are some trade-offs here, but it's an interesting approach.

asmlang_6 · Post by **asmlang_6** » Fri Jul 22, 2005 1:21 am

TMorita wrote:

It's called a "perfect hash".

It's pretty impossible to have a perfect hash function. Considering that on the 6502, we can store 256 diffrent values. ASCII is an 7-bit code. 128/8=16. According to The Art Of Computer Programming, Vol 3, there are 41e31 approx. 10e50 possible functions from a 31-element set into a 41-element set, which is less then 128. 41*40*...*11=41!/10! approx 10e43 are perfect. Factor this to the fact that the 6502 is 8 bits and there is no possible way to get a perfect hash (with a dictionary that has more then 256 (65536 on 816) elements). Even on the 65C816.

kc5tja · Post by **kc5tja** » Fri Jul 22, 2005 3:06 pm

asmlang_6 wrote:

TMorita wrote:

It's called a "perfect hash".

It's pretty impossible to have a perfect hash function. Considering that on the 6502, we can store 256 diffrent values. ASCII is an 7-bit code. 128/8=16. According to The Art Of Computer Programming, Vol 3, there are 41e31 approx. 10e50 possible functions from a 31-element set into a 41-element set, which is less then 128. 41*40*...*11=41!/10! approx 10e43 are perfect. Factor this to the fact that the 6502 is 8 bits and there is no possible way to get a perfect hash. Even on the 65C816.

41e31 is substantially larger than 128, as is 10e50. I'm very confused about what you're trying to say there.

The 6502 and 65816 can store a whole lot more than just 256 data. The 6502 has a 64KB address space, and the 65816 can address 16MB.

*NO* processor, regardless of how wide its data paths, can arrange for a perfect hash with an arbitrary set of keys. Even MD5 has recently had collisions.

The name "Perfect Hash" stems from the fact that, given n keys, you can guarantee that each key will hash into buckets 0..n-1, without any collisions. But there are many restrictions: see http://burtleburtle.net/bob/hash/perfect.html

TMorita · Post by **TMorita** » Sun Jun 18, 2006 12:53 am

asmlang_6 wrote:

TMorita wrote:

It's called a "perfect hash".

It's pretty impossible to have a perfect hash function. Considering that on the 6502, we can store 256 diffrent values. ASCII is an 7-bit code. 128/8=16. According to The Art Of Computer Programming, Vol 3, there are 41e31 approx. 10e50 possible functions from a 31-element set into a 41-element set, which is less then 128. 41*40*...*11=41!/10! approx 10e43 are perfect. Factor this to the fact that the 6502 is 8 bits and there is no possible way to get a perfect hash (with a dictionary that has more then 256 (65536 on 816) elements). Even on the 65C816.

No, you're missing the point.

Perfect hashes are used for things like decoding statically-defined keywords, so all the keywords will hash to a different value and you only need to perform one string check.

It's not to be used for general-purpose string matching.

Toshi

bogax · Post by **bogax** » Mon Sep 20, 2010 4:44 am

I don't know if anyone's following chitselb's adventures in comp.lang.forth
but I thought I'd point out this thread:

"What are my options for dictionary hashing?"

There's some interesting stuff there, unfortunately there's also a lot of
rubbish to wade through.

Looking at the statistics there I think I'd try finding some very simple
(ie fast) hash that put the most frequently used words in seperate buckets
(while retaining a reasonable distribution for everything else) then
promote the most recently used to the second spot in the list (leaving
the predetermined most used in first place)

GARTHWILSON · Post by **GARTHWILSON** » Mon Sep 20, 2010 6:12 am

bogax, thanks for the alert. I've been reading for a long time and I'm still only half way through it. As for the "rubish," yes, there are a couple of particularly abrasive people on there dukin' it out. It's rather distasteful but I can't say I haven't learned anything from it. There's a ton of material there (taking several pages) and Elizabeth Rather herself even posted there. It seems to me that we made more-efficient progress here.

chitselb · Post by **chitselb** » Wed Sep 22, 2010 1:01 am

Wow, I didn't know this thread was here. Guess I have some catching up to do. Thanks for waking it up after 5 years, bogax! I admit it, I was about as thrilled to see Elizabeth Rather posting as I would have been to have Grace Hopper check out my COBOL program or having Phil Zimmermann sign my PGP key. Oh wait, he did that

I'd like to get your feedback on what I'm doing, to wit:
* make a lot of internals words headerless since users won't need them
* standard linked list searching/chaining in child vocabularies...
* until getting to the top of the root (FORTH) vocabulary
* 4-bit Pearson hash indexing into...
* sixteen uniformly-distributed threads...
* sorted in ascending order by name length...
* with a Bloom filter to prevent us from trying to find most numbers

It could always be "more optimal" but it's already going to blow the doors off single linked list (and worst case for all numeric literals) that fig/Blazin'/et al do.

barrym95838 · Post by **barrym95838** » Wed Apr 09, 2014 5:20 am

Another bump ...

I have noticed that the different Forths out and about have quite different dictionary orders. Some seem to group primitives together, then secondaries. Others seem to group by function, and others seem to group by popularity. Do any of the more-experienced users here have a preference for a simple linear-search (FIND) implementation? I realize that certain words like FORGET can have much-different global effects based on the dictionary orders, right? Or is the 'base' vocabulary usually FENCEd-off out of harm's way?

I'm sorry if these questions have painfully-obvious answers. I'm a Forth n00b, but I couldn't decide whether to post here or in 'Newbies'.

Mike

Brad R · Post by **Brad R** » Wed Apr 09, 2014 1:49 pm

I also like a simple linear-list implementation. But then I work in embedded systems; execution speed is important, but compilation speed is not.

I believe some people have gone to the trouble of analyzing which kernel words are most often used in compilation, and placing them later in the linked list so they're found quickly. (I vaguely recall that New Micros' 68HC11 Max-Forth did this.)

IIRC, the only requirement ANS Forth places on dictionary order is that if you redefine a word, the newest definition must appear when you search the dictionary for that word.

As a rule, one should not FORGET words which are in the kernel.

Many Forths deliberately block this action. The ANS Forth MARKER word avoids this problem.

JimBoyd · Post by **JimBoyd** » Fri Aug 17, 2018 12:40 am

Has anyone tried tinkering with the header structure?
My forth for the C64 has the following header structure: link field name field code field parameter field.
The link field of one word would point to the name field of the previous word in the dictionary. When I changed the system so the link field of a word pointed to the link field of the previous word, I was able to rewrite (FIND), the find primitive, so that it took about two thirds the time and it was also smaller. (FIND) also now took the address of a vocabulary instead of the address of the last word's name field in that vocabulary.

Code: Select all

This:
CONTEXT @ @ (FIND)
Became this:
CONTEXT @ (FIND)
And this:
CURRENT @ @ (FIND)
Became this:
CURRENT @ (FIND)

I also had to rewrite FORGET, CREATE, and my metacompiler's version of CREATE but it was worth it! Faster dictionary searches without any hash table or hash function.
I also wrote a word for use in WORD to speed things up. >HERE is a code word that takes an address and a count. It copies the string to HERE as a counted string with a trailing blank as per Forth-83.

Cheers,
Jim

JimBoyd · Post by **JimBoyd** » Fri Aug 17, 2018 1:06 am

Brad R wrote:

As a rule, one should not FORGET words which are in the kernel.

Many Forths deliberately block this action. The ANS Forth MARKER word avoids this problem.

My Forth for the Commodore 64, a Forth-83 standard system, has the variable FENCE, which can be lowered, it also has a hidden internal fence built into FORGET, a literal address FORGET compares the requested forget point with. Only words above FORTH-83 can be forgotten, provided FENCE is low enough. It seemed like the reasonable thing to do.

chitselb · Post by **chitselb** » Fri Aug 17, 2018 4:16 am

In PETTIL I separated the header from the code entirely, putting names and CFAs in a symbol table. I also divided the dictionary in half with anything that touches the symbol table in the upper half (interpret, editor, assembler, compiler...). The lower half contains a 'core' cadre of words, and any user code that gets compiled. So when all the coding is said and done, the symbol table and developer environment can go away and free up 10K or so.

whartung · Post by **whartung** » Fri Aug 17, 2018 8:49 pm

I haven't looked at how F83 FORGETs words. It hashed the words across 4 different paths. But each of those were, effectively, just 4 independent dictionaries, so the word headers were not changed at all to accommodate this technique.

Simply, you'd hash the value, decide which of the 4 lanes to pursue, and then search that lane to find your word.

Similarly, when a word is added, it picks the lane, and just appends the word to that lane.

dictionary hashing

Re: Perfect hash

Re: Perfect hash

Re: Perfect hash

Re: dictionary hashing

Re: dictionary hashing

Re: dictionary hashing

Re: dictionary hashing

Re: dictionary hashing

Re: dictionary hashing