dictionary hashing
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Hmmm... I think that would foul things up for FORGET or MARKER too. Still, let me know if I'm misunderstanding your idea.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
GARTHWILSON wrote:
Hmmm... I think that would foul things up for FORGET or MARKER too. Still, let me know if I'm misunderstanding your idea.
Anyway, for any given information service, ADD, EDIT, and DELETE all need to know the internal structure of the database they're adding to, editing, or removing from, and all need to know how to search for their data. Otherwise, they couldn't do their task. Hence, if you make changes to how things are added or edited, you can bet that you're going to need to adjust the deletion procedures too.
In other words, what I'm trying to say is, "expect it, and prepare for it."
I'm actually undergoing this right now with my own FTS/Forth (a native-booting Forth for x86 PCs!!), being based on RetroForth 8 running under the L4 microkernel. Well, I've had to rewrite pretty close to 66% of the code, because of all the dependencies the code has on the vocabulary and how it handled user input. (When you think about it, there are only three major components to any Forth system: keyboard input (which I'm going to rewrite shortly), video output (already rewritten), and dictionary database (which I've rewritten about half the code for). So yeah, I'm easily approaching a 66% rewrite of the code.
I'm going to end up rewriting everything again soon enough anyway, but right now I'm concentrating getting as much of the original code running as possible. Believe it or not, I actually have about four people waiting for its completion, including (ironically) the author of RF8.
GARTHWILSON wrote:
(This would of course mean the headers all have to be in RAM, not ROM.)
Code: Select all
ORG TOP_OF_RAM-NUMBER_OF_HEADS*4
LINK4: DW LINK3,D_HEAD
LINK3: DW LINK2,C_HEAD
LINK2: DW LINK1,B_HEAD
LINK1: DW END_OF_LINKS,A_HEAD
Code: Select all
NIP_HEAD: DB 3,"NIP" ; name length, name
NIP: DW DO_COLON,SWAP,DROP,EXIT
Anyway, having the entire dictionary cached does have some interesting properties.
1. You don't have to decide how big to make the cache, or whether you would want to allow it to be resized.
2. FIND will be a smaller, simpler (but not necessarily faster) routine, since it only has to search the cache, not the cache then the dictionary.
3. FIND doesn't have to compare the same word twice. For example, if A is a recently defined word and is in the cache, and B is not recent and not in the cache, then when using a separate cache and searching for B, FIND will usually compare B to A twice, once when searching the cache and once when searching the dictionary.
4. Words that are used occasionally (i.e. they wouldn't usually be found in a separate cache), would not necessarily fall all the way to the end (i.e. last word found) of the dictionary, so you might be able to do without hashing. I should point out that the use of a cache does not depend on any particular dictionary implementation (hashing, linked lists, etc.).
So there are some trade-offs here, but it's an interesting approach.
Re: Perfect hash
TMorita wrote:
It's called a "perfect hash".
Last edited by asmlang_6 on Sun Aug 21, 2005 4:52 pm, edited 1 time in total.
Sam
---
"OK, let's see, A0 on the 6502 goes to the ROM. Now where was that reset vector?"
---
"OK, let's see, A0 on the 6502 goes to the ROM. Now where was that reset vector?"
Re: Perfect hash
asmlang_6 wrote:
TMorita wrote:
It's called a "perfect hash".
The 6502 and 65816 can store a whole lot more than just 256 data. The 6502 has a 64KB address space, and the 65816 can address 16MB.
*NO* processor, regardless of how wide its data paths, can arrange for a perfect hash with an arbitrary set of keys. Even MD5 has recently had collisions.
The name "Perfect Hash" stems from the fact that, given n keys, you can guarantee that each key will hash into buckets 0..n-1, without any collisions. But there are many restrictions: see http://burtleburtle.net/bob/hash/perfect.html
Re: Perfect hash
asmlang_6 wrote:
TMorita wrote:
It's called a "perfect hash".
Perfect hashes are used for things like decoding statically-defined keywords, so all the keywords will hash to a different value and you only need to perform one string check.
It's not to be used for general-purpose string matching.
Toshi
I don't know if anyone's following chitselb's adventures in comp.lang.forth
but I thought I'd point out this thread:
"What are my options for dictionary hashing?"
There's some interesting stuff there, unfortunately there's also a lot of
rubbish to wade through.
Looking at the statistics there I think I'd try finding some very simple
(ie fast) hash that put the most frequently used words in seperate buckets
(while retaining a reasonable distribution for everything else) then
promote the most recently used to the second spot in the list (leaving
the predetermined most used in first place)
but I thought I'd point out this thread:
"What are my options for dictionary hashing?"
There's some interesting stuff there, unfortunately there's also a lot of
rubbish to wade through.
Looking at the statistics there I think I'd try finding some very simple
(ie fast) hash that put the most frequently used words in seperate buckets
(while retaining a reasonable distribution for everything else) then
promote the most recently used to the second spot in the list (leaving
the predetermined most used in first place)
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
bogax, thanks for the alert. I've been reading for a long time and I'm still only half way through it. As for the "rubish," yes, there are a couple of particularly abrasive people on there dukin' it out. It's rather distasteful but I can't say I haven't learned anything from it. There's a ton of material there (taking several pages) and Elizabeth Rather herself even posted there. It seems to me that we made more-efficient progress here.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Wow, I didn't know this thread was here. Guess I have some catching up to do. Thanks for waking it up after 5 years, bogax! I admit it, I was about as thrilled to see Elizabeth Rather posting as I would have been to have Grace Hopper check out my COBOL program or having Phil Zimmermann sign my PGP key. Oh wait, he did that 
I'd like to get your feedback on what I'm doing, to wit:
* make a lot of internals words headerless since users won't need them
* standard linked list searching/chaining in child vocabularies...
* until getting to the top of the root (FORTH) vocabulary
* 4-bit Pearson hash indexing into...
* sixteen uniformly-distributed threads...
* sorted in ascending order by name length...
* with a Bloom filter to prevent us from trying to find most numbers
It could always be "more optimal" but it's already going to blow the doors off single linked list (and worst case for all numeric literals) that fig/Blazin'/et al do.
I'd like to get your feedback on what I'm doing, to wit:
* make a lot of internals words headerless since users won't need them
* standard linked list searching/chaining in child vocabularies...
* until getting to the top of the root (FORTH) vocabulary
* 4-bit Pearson hash indexing into...
* sixteen uniformly-distributed threads...
* sorted in ascending order by name length...
* with a Bloom filter to prevent us from trying to find most numbers
It could always be "more optimal" but it's already going to blow the doors off single linked list (and worst case for all numeric literals) that fig/Blazin'/et al do.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: dictionary hashing
Another bump ...
I have noticed that the different Forths out and about have quite different dictionary orders. Some seem to group primitives together, then secondaries. Others seem to group by function, and others seem to group by popularity. Do any of the more-experienced users here have a preference for a simple linear-search (FIND) implementation? I realize that certain words like FORGET can have much-different global effects based on the dictionary orders, right? Or is the 'base' vocabulary usually FENCEd-off out of harm's way?
I'm sorry if these questions have painfully-obvious answers. I'm a Forth n00b, but I couldn't decide whether to post here or in 'Newbies'.
Mike
I have noticed that the different Forths out and about have quite different dictionary orders. Some seem to group primitives together, then secondaries. Others seem to group by function, and others seem to group by popularity. Do any of the more-experienced users here have a preference for a simple linear-search (FIND) implementation? I realize that certain words like FORGET can have much-different global effects based on the dictionary orders, right? Or is the 'base' vocabulary usually FENCEd-off out of harm's way?
I'm sorry if these questions have painfully-obvious answers. I'm a Forth n00b, but I couldn't decide whether to post here or in 'Newbies'.
Mike
Re: dictionary hashing
I also like a simple linear-list implementation. But then I work in embedded systems; execution speed is important, but compilation speed is not.
I believe some people have gone to the trouble of analyzing which kernel words are most often used in compilation, and placing them later in the linked list so they're found quickly. (I vaguely recall that New Micros' 68HC11 Max-Forth did this.)
IIRC, the only requirement ANS Forth places on dictionary order is that if you redefine a word, the newest definition must appear when you search the dictionary for that word.
As a rule, one should not FORGET words which are in the kernel.
Many Forths deliberately block this action. The ANS Forth MARKER word avoids this problem.
I believe some people have gone to the trouble of analyzing which kernel words are most often used in compilation, and placing them later in the linked list so they're found quickly. (I vaguely recall that New Micros' 68HC11 Max-Forth did this.)
IIRC, the only requirement ANS Forth places on dictionary order is that if you redefine a word, the newest definition must appear when you search the dictionary for that word.
As a rule, one should not FORGET words which are in the kernel.
Because there are never enough Forth implementations: http://www.camelforth.com
Re: dictionary hashing
Has anyone tried tinkering with the header structure?
My forth for the C64 has the following header structure: link field name field code field parameter field.
The link field of one word would point to the name field of the previous word in the dictionary. When I changed the system so the link field of a word pointed to the link field of the previous word, I was able to rewrite (FIND), the find primitive, so that it took about two thirds the time and it was also smaller. (FIND) also now took the address of a vocabulary instead of the address of the last word's name field in that vocabulary.
I also had to rewrite FORGET, CREATE, and my metacompiler's version of CREATE but it was worth it! Faster dictionary searches without any hash table or hash function.
I also wrote a word for use in WORD to speed things up. >HERE is a code word that takes an address and a count. It copies the string to HERE as a counted string with a trailing blank as per Forth-83.
Cheers,
Jim
My forth for the C64 has the following header structure: link field name field code field parameter field.
The link field of one word would point to the name field of the previous word in the dictionary. When I changed the system so the link field of a word pointed to the link field of the previous word, I was able to rewrite (FIND), the find primitive, so that it took about two thirds the time and it was also smaller. (FIND) also now took the address of a vocabulary instead of the address of the last word's name field in that vocabulary.
Code: Select all
This:
CONTEXT @ @ (FIND)
Became this:
CONTEXT @ (FIND)
And this:
CURRENT @ @ (FIND)
Became this:
CURRENT @ (FIND)
I also wrote a word for use in WORD to speed things up. >HERE is a code word that takes an address and a count. It copies the string to HERE as a counted string with a trailing blank as per Forth-83.
Cheers,
Jim
Re: dictionary hashing
Brad R wrote:
As a rule, one should not FORGET words which are in the kernel.
Many Forths deliberately block this action. The ANS Forth MARKER word avoids this problem.
Re: dictionary hashing
In PETTIL I separated the header from the code entirely, putting names and CFAs in a symbol table. I also divided the dictionary in half with anything that touches the symbol table in the upper half (interpret, editor, assembler, compiler...). The lower half contains a 'core' cadre of words, and any user code that gets compiled. So when all the coding is said and done, the symbol table and developer environment can go away and free up 10K or so.
Re: dictionary hashing
I haven't looked at how F83 FORGETs words. It hashed the words across 4 different paths. But each of those were, effectively, just 4 independent dictionaries, so the word headers were not changed at all to accommodate this technique.
Simply, you'd hash the value, decide which of the 4 lanes to pursue, and then search that lane to find your word.
Similarly, when a word is added, it picks the lane, and just appends the word to that lane.
Simply, you'd hash the value, decide which of the 4 lanes to pursue, and then search that lane to find your word.
Similarly, when a word is added, it picks the lane, and just appends the word to that lane.