6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 4:14 am

All times are UTC




Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: Search Engine in Forth
PostPosted: Fri Mar 12, 2021 8:09 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
In an earlier thread I created a HELP search engine that returned information about a word delimited by a colon.

I tried applying the same search engine to finding any word within a number of screens. But it is a lot slower than I would like it to be. The problem is that using the space as a delimiter, then every word has to be checked for a match.

In the HELP engine, a match is only checked when it comes across a colon. A lot fewer matches are made and the search is quite a bit faster.

Even though a linear search is done, and a match is checked on every word, I still expected it to be faster that it is. The order of the search is this:

1) LOAD SCREEN
2) ENCLOSE each word delimited by a SPACE
3) use MATCH to check each word against the SEARCH word
4) print screen# and line# of each matched word found
5) if not end of screen, increment pointer and return to step #2
6) if not last screen, increment screen and return to step #1
7) end

I limited the search to 16 screens ( 0 SEARCH searches 00..$0F, $10 SEARCH searches $10..$1F and so on ) which is same as searching 16k in memory. At 1 Mhz, it takes 1 min 10 secs to search all 16 screens for each occurrence of the word.

I know I could make the search into a primitive, but I was just wondering if there might be a different way to search random words to make the search programmed entirely in Forth go faster. Reducing the number of times MATCH is used (like in HELP) helped speed things up. But doing a search on every word delimited by spaces is a lot slower. In HELP, a search for a colon is done before using ENCLOSE then entering MATCH.

Any ideas?


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 12, 2021 9:03 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 255
How about a binary search? It would require your words to be in ASCIIbetical order. For a given range of screens, start on the "center" screen and read just the first word. Determine if the word you want is this word you are looking at, or if it should be on the right or left side of this word.

Once you've narrowed it down to a single screen, you can either do a linear traversal of the screen like you were doing before or you can continue binary searching within the screen (eg. go to the middle line) to narrow your search. While it would be nice to have all of the entries the same size to make it much faster to locate the start of a word, you could just start anywhere and move forward to find the delimiter (eg. : ) taking care to handle the "end of the screen" case. Then read the next word and check it.

With 8 screens, it would probably take no more than 10 checks (no more than 3 to determine the screen it should be on and probably no more than 7 more checks within that screen to locate (or determine it doesn't exist) a word with your data.

If you evenly spaced your words, it would take more space but be faster to calculate where a word started rather than hunting for a delimiter.


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 13, 2021 5:04 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
SamCoVT wrote:
How about a binary search? It would require your words to be in ASCIIbetical order. For a given range of screens, start on the "center" screen and read just the first word. Determine if the word you want is this word you are looking at, or if it should be on the right or left side of this word.

Once you've narrowed it down to a single screen, you can either do a linear traversal of the screen like you were doing before or you can continue binary searching within the screen (eg. go to the middle line) to narrow your search. While it would be nice to have all of the entries the same size to make it much faster to locate the start of a word, you could just start anywhere and move forward to find the delimiter (eg. : ) taking care to handle the "end of the screen" case. Then read the next word and check it.

With 8 screens, it would probably take no more than 10 checks (no more than 3 to determine the screen it should be on and probably no more than 7 more checks within that screen to locate (or determine it doesn't exist) a word with your data.

If you evenly spaced your words, it would take more space but be faster to calculate where a word started rather than hunting for a delimiter.
I should have mentioned.

The screens contain word definitions and are not a list of words, so can't be alphabetized. Also, all occurrences of stated word are wanted to be listed, so have to scan all word definitions on all 16 screens.

One thought I came up with was to only search for the starting letter of the search word. There are a lot more spaces than there are words that start with the beginning letter. The problem comes when using the same code as the HELP engine. Using a colon greatly speeds up the HELP search as it scans for a colon before making a match. This eliminates checking all the info data of each word and only focuses on the primary word.

But the colon search is interfering with the word search as the pointer is incremented 1 position past the colon before entering the MATCH routine. When searching for a starting letter, the incremented position doesn't work and adds quite a bit of extra code to check for it. I was hoping to be able to use the search engine for both instances, but it looks like I may have to write a 2nd word definition for a separate search.


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 17, 2021 6:33 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 255
Do all the definitions start at the left edge of the screen? If that's the case, you only need to check 16 times for a colon at the left edge, and can then check for the name only on lines that are actually definitions.

If, on the other hand, you're "squishing" as many definitions into a screen as you possibly can, then you're going to be stuck doing a linear search if the list is also unordered. I would at least consider alphabetizing the words within each screen to allow a faster search if you're going to be doing it often enough that you really need better performance.

One other thought is to modify your dictionary structure to include the screen a word came from in the header for each word. If you don't mind the extra memory use in the headers, you could include the line number as well. Words that did not come from a screen could use the value 0, as you can't LOAD from screen 0. This would make your searches the same speed as a dictionary lookup.


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 17, 2021 8:10 pm 
Offline

Joined: Fri May 05, 2017 9:27 pm
Posts: 895
SamCoVT wrote:
Do all the definitions start at the left edge of the screen? If that's the case, you only need to check 16 times for a colon at the left edge, and can then check for the name only on lines that are actually definitions.

I got the impression IamRob was referring to searching the source to find each time a given word is used in the definition of other words.

Quote:
One other thought is to modify your dictionary structure to include the screen a word came from in the header for each word. If you don't mind the extra memory use in the headers, you could include the line number as well. Words that did not come from a screen could use the value 0, as you can't LOAD from screen 0. This would make your searches the same speed as a dictionary lookup.


Some Forth's have this feature. With my Forth the view field ( that extra cell in the header) can be included by setting a variable called VIEW to any non zero value and disabled by setting it to zero. If present, the view field is the first field of the header.
Typing LOCATE FOOBAR will list the screen where FOOBAR was loaded from.
None of the kernel words in Fleet Forth have a view field.


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 18, 2021 4:10 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
JimBoyd wrote:
SamCoVT wrote:
Do all the definitions start at the left edge of the screen? If that's the case, you only need to check 16 times for a colon at the left edge, and can then check for the name only on lines that are actually definitions.

I got the impression IamRob was referring to searching the source to find each time a given word is used in the definition of other words.

That is correct. Not just the colon word gets checked, but also all definition words as well. The search comes in real handy when finding every instance of a variable or a word that needs renaming.

Quote:

Quote:
One other thought is to modify your dictionary structure to include the screen a word came from in the header for each word. If you don't mind the extra memory use in the headers, you could include the line number as well. Words that did not come from a screen could use the value 0, as you can't LOAD from screen 0. This would make your searches the same speed as a dictionary lookup.


Some Forth's have this feature. With my Forth the view field ( that extra cell in the header) can be included by setting a variable called VIEW to any non zero value and disabled by setting it to zero. If present, the view field is the first field of the header.
Typing LOCATE FOOBAR will list the screen where FOOBAR was loaded from.
None of the kernel words in Fleet Forth have a view field.

Instead of each word having an extra byte in the header for the screen number, I put a list of all the colon words in line #0 of each screen, which can be seen with INDEX. It is quick and easy to list the Indexes of each screen to find a word definition. Ultimately if I just want to find a colon definition, I would just search line #0.


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 19, 2021 11:54 pm 
Offline

Joined: Fri May 05, 2017 9:27 pm
Posts: 895
IamRob wrote:
In an earlier thread I created a HELP search engine that returned information about a word delimited by a colon.

I tried applying the same search engine to finding any word within a number of screens. But it is a lot slower than I would like it to be. The problem is that using the space as a delimiter, then every word has to be checked for a match.

In the HELP engine, a match is only checked when it comes across a colon. A lot fewer matches are made and the search is quite a bit faster.

Even though a linear search is done, and a match is checked on every word, I still expected it to be faster that it is. The order of the search is this:

1) LOAD SCREEN
2) ENCLOSE each word delimited by a SPACE
3) use MATCH to check each word against the SEARCH word
4) print screen# and line# of each matched word found
5) if not end of screen, increment pointer and return to step #2
6) if not last screen, increment screen and return to step #1
7) end

I limited the search to 16 screens ( 0 SEARCH searches 00..$0F, $10 SEARCH searches $10..$1F and so on ) which is same as searching 16k in memory. At 1 Mhz, it takes 1 min 10 secs to search all 16 screens for each occurrence of the word.

I know I could make the search into a primitive, but I was just wondering if there might be a different way to search random words to make the search programmed entirely in Forth go faster. Reducing the number of times MATCH is used (like in HELP) helped speed things up. But doing a search on every word delimited by spaces is a lot slower. In HELP, a search for a colon is done before using ENCLOSE then entering MATCH.

Any ideas?


With the Forths that I am familiar with, LOADing a screen will cause that screen to be interpreted and or compiled. Do you mean read that screen into a buffer?

Fleet Forth's editor is based on the Forth Inc. line editor presented in Forth Dimensions volume 3 issue 3. It has a word S which takes a block number and parses text up to a delimiting ^ and searches all blocks from the current one to one less than the block number that was on the stack. I ran some tests by copying a range of sixteen blocks to blocks used by the C64 Ram Expansion Module to eliminate disk access time as a variable. Depending on the text in the screens and what was searched for, I got results ranging from 11.6 to 20.2 seconds. Some of that time is the time to TYPE the information and the time it takes the Commodore 64 to scroll.
Code:
 OK
CLEAR.TIME $10 FH S  LDA ^ ELAPSED
   IP )Y LDA ^ PHA  IP INC                                        2 8001
   IP )Y LDA ^                                                    4 8001
   IP )Y LDA ^ W 1+ STA  DEY                                      E 8001
   IP )Y LDA ^ W    STA  CLC                                      F 8001
   IP LDA ^ 2 # ADC  IP STA                                       1 8002
   IP )Y LDA ^ PHA  TYA                                           3 8003
      0 ,X LDA ^ N ,Y STA                                         9 8003
   0FE ,X LDA ^ 0FF ,X ORA                                        4 8005
      IP LDA ^ 2 # ADC  IP STA                                    8 8005
   IP )Y LDA ^ PHA  INY                                           2 8006
   IP )Y LDA ^ IP 1+ STA                                          3 8006
   IP )Y LDA ^ PHA  DEY                                           4 8007
   IP )Y LDA ^ PHA  CLC                                           5 8007
   3 ,X LDA ^ 80 # ADC  PHA                                       6 8007
   3 ,X STA  2 ,X LDA ^ PHA  SEC                                  7 8007
   0 ,X LDA ^ 2 ,X SBC  TAY                                       8 8007
   1 ,X LDA ^ 3 ,X SBC  PHA                                       9 8007
   0 ,X LDA ^ 2 ,X CMP                                            3 8008
   1 ,X LDA ^ 3 ,X CMP                                            5 8008
   0FE ,X LDA ^ 0FF ,X ORA                                        8 800A
   0FE ,X LDA ^ 0FF ,X ORA                                        7 800B
   0FE ,X LDA ^ 0FF ,X ORA                                        C 800B
   0FF ,X LDA ^ PHA  0FE ,X LDA                                   4 800C
   0FF ,X LDA  PHA  0FE ,X LDA ^                                  4 800C
   2 ,X LDA ^ 2 # SBC  N 2+  STA                                  5 800D
   3 ,X LDA ^ 0 # SBC  N 3 + STA                                  6 800D
   0 ,X LDY  1 ,X LDA ^                                           7 800D
         N )Y LDA ^ TAX  INY                                      E 800D
         N )Y LDA ^                                               F 800D
         N )Y LDA ^                                               4 800E
            N 2+ )Y LDA ^                                         A 800E
         0 # LDA ^                                                3 800F
         N )Y LDA ^ IBIT # AND                                    6 800F
            1 # LDA ^                                             9 800F
         0FF # LDA ^ TAY                                          4 8010
      N 4 + )Y LDA ^                                              8 8010
      N 4 + )Y LDA ^                                              C 8010
 0:00:20.2   OK
CLEAR.TIME $10 FH S  PHA ^ ELAPSED
   IP )Y LDA  PHA ^ IP INC                                        2 8001
   IP )Y LDA  PHA ^ TYA                                           3 8003
   IP )Y LDA  PHA ^ INY                                           2 8006
   IP )Y LDA  PHA ^ DEY                                           4 8007
   IP )Y LDA  PHA ^ CLC                                           5 8007
   3 ,X LDA  80 # ADC  PHA ^                                      6 8007
   3 ,X STA  2 ,X LDA  PHA ^ SEC                                  7 8007
   1 ,X LDA  3 ,X SBC  PHA ^                                      9 8007
   TYA  PHA ^                                                     A 8007
   0FF ,X LDA  PHA ^ 0FE ,X LDA                                   4 800C
 0:00:13.4   OK
CLEAR.TIME $10 FH S  ADC ^ ELAPSED
   IP LDA  2 # ADC ^ IP STA                                       1 8002
      IP LDA  2 # ADC ^ IP STA                                    8 8005
   3 ,X LDA  80 # ADC ^ PHA                                       6 8007
   102 ,X ADC ^ 102 ,X STA                                        7 8009
   101 ,X ADC ^ 101 ,X STA                                        6 800C
         TYA  N ADC ^ 0 ,X STA                                    2 800F
         N 1+ ADC ^ 1 ,X STA                                      4 800F
 0:00:12.7   OK
78 1 RAM $10 BMOVE
   87  8010
   86  800F
   85  800E
   84  800D
   83  800C
   82  800B
   81  800A
   80  8009
   7F  8008
   7E  8007
   7D  8006
   7C  8005
   7B  8004
   7A  8003
   79  8002
   78  8001 OK
CLEAR.TIME $10 FH S  DUP ^ ELAPSED
   R> DUP ^2+ >R @ >BODY ! ;                                      3 8002
   R> DUP ^COUNT + >R ;                                           9 8002
   AP0 @ DUP ^! ;                                                 F 8002
   DUP ^>IN @ UMIN                                                9 8005
   DUP ^NEGATE  UNDER+ UNDER+ ;                                   A 8005
   DUP ^0                                                         B 8006
   DUP ^0<> + NEGATE ROT +                                        5 800A
      DUP ^SCR ! CR ." SCR# " U.                                  8 800C
 0:00:12.9   OK
CLEAR.TIME $10 FH S  DROP ^ ELAPSED
   DROP ^PAUSE ;                                                  6 8001
   DROP ^PAUSE ;                                                  C 8001
      'STREAM DROP ^ >IN @ C/L /MOD                               9 800C
 0:00:11.6   OK
CONSOLE

The word FH ( from here ) when editing, will add the number in SCR to the value on the stack. When loading a block, it will add the value in BLK to the number on the stack.

The heart of the editor's search function is the word MATCH . It takes the address and count of a string to search ( in this case, the address of the block buffer and $400) and the address and count of the string to find ( in this case, the contents of the editors find buffer, FBUF , a 65 byte buffer). MATCH returns the offset to the character just past the search string in the string to search and a flag.


Top
 Profile  
Reply with quote  
PostPosted: Sat Mar 20, 2021 6:07 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
JimBoyd wrote:
The heart of the editor's search function is the word MATCH . It takes the address and count of a string to search ( in this case, the address of the block buffer and $400) and the address and count of the string to find ( in this case, the contents of the editors find buffer, FBUF , a 65 byte buffer). MATCH returns the offset to the character just past the search string in the string to search and a flag.

I see a couple of possible reasons you might be getting good speed results.
First you are searching for a very popular word "LDA ^" that is only 6-10 chars in on each line. If the rest of the line is bypassed after a MATCH has been made and it drops down to the next line, that could speed up the search significantly.

Also your lines have a lot of spaces at the midpoint of each line, which can also be bypassed or skimmed over very quickly. I made some adjustments taking these two ideas into consideration and was able to reduce the search by about 20 secs.

If I find I am using SEARCH quite a bit, I might make a primitive definition out of it yet.

Thanks


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 21, 2021 9:05 pm 
Offline

Joined: Fri May 05, 2017 9:27 pm
Posts: 895

No, Fleet Forth's editor word S does not skip white space nor does it skip to the next line when a match is found. Here is another example. I've copied some screens from Blazin' Forth's system loader to blocks in the C64's Ram Expansion Unit. I'm using screens from Blazin' Forth's system loader because the source in these screens is densely packed.
Code:
 OK
CLEAR.TIME $10 FH S THEN^ ELAPSED
N 1+ SBC, 0= NOT IF, 0FF # LDA, N STA, THEN^, N LDA, D001 ,Y STA, 5 8002
N 2+ STA, THEN^, N 2+ LDA, D000 ,Y STA, N 3 + LDA, 0= NOT         8 8002
THEN^, D010 STA, XSAVE LDX, NEXT JMP, END-CODE                    A 8002
     ELSE 2DUP < IF SWAP THEN^ R@ MOVES 2+ ! MOVES OFF R> THEN    6 8003
     ELSE 2DUP < IF SWAP THEN R@ MOVES 2+ ! MOVES OFF R> THEN^    6 8003
     DRAW-TURTLE #ON C@ 80 OR DUP D015 C! #ON ! TPOS THEN^ ;      B 8005
     01 DC0D C! (SPLIT) 0314 ! 1 D01A C! SPLIT? ON THEN^ ;        3 8006
  0 D01A C! EA31 0314 ! 81 DC0D C! GRON GBACK @ D020 C! THEN^ ;   6 8006
  N ORA, N 1+ )Y STA, INY, 0= IF, N 2+ INC, DEX, THEN^,           5 8007
   IF GBACK @ DUP 53280 C! 53281 C! THEN^                         E 8007
   MULTI @ NOT IF HRBG THEN^ ;                                    F 8007
            ELSE 0F AND DUP PEN ! D02E C! THEN^ ;                 C 8008
   PENSTATE @ IF ASPECT DLINE ELSE 2DROP THEN^                    4 8009
   TURTLESTATE @ IF TPOS THEN^ ;                                  5 8009
    DUP 0< IF 360 + THEN^ 360 MOD HEADING !                       B 8009
   TURTLESTATE @ IF DRAW-TURTLE THEN^ ;                           C 8009
CODE (RY) ?Y JSR, CS IF, 0 # LDA, ELSE, 255 # LDA, THEN^,         3 800B
CODE (TX) ?X JSR, CS IF, 0 # LDA, ELSE, 255 # LDA, THEN^,         5 800B
: ?WINDOW IF ND TRUE ABORT" WINDOW OUT OF RANGE" THEN^ ;          7 800B
   SPLIT? @ IF SP 0 19 CURSOR THEN^ TSET ;                        D 800B
 D01C C@ [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^     9 800C
   [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D01D C! ; 7 800E
  [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D017 C! ;  B 800E
  ELSE MASKS SPRITE# @ 7 XOR + C@ OR  THEN^ D01B C! ;             E 800E
 0:00:16.4   OK
CONSOLE

Notice this part of the system log here:
Code:
     ELSE 2DUP < IF SWAP THEN^ R@ MOVES 2+ ! MOVES OFF R> THEN    6 8003
     ELSE 2DUP < IF SWAP THEN R@ MOVES 2+ ! MOVES OFF R> THEN^    6 8003

where both occurrences of the string THEN on line 6 was found. Since it is searching for matching strings, and not matching Forth words, it also finds the search string THEN in Blazin' Forth's assembler's THEN, with a trailing comma.
Code:
N 1+ SBC, 0= NOT IF, 0FF # LDA, N STA, THEN^, N LDA, D001 ,Y STA, 5 8002

Note that the ^ (carat) character is not part of the search string. It is not in the source. It is used by S as a delimiter. It also appears in the display to show where the search string was found on the line. S only needs the delimiter if there are any trailing blanks in the search string or if there are words to be executed right after S
Here is a quick timing word I threw together to demonstrate:
Code:
: TIMEIT
   CLEAR.TIME ' EXECUTE ELAPSED ;

In the Forth-83 Standard, ' (tick) is not immediate so it gets compiled.
Here is a portion of a system log using that word. I cut out most of it because it is 140 lines long.
Code:
 OK
#105 FH TIMEIT S THEN
: IS   STATE @ IF COMPILE (IS) ELSE ' >BODY ! THEN^ ; IMMEDIATE   A 8003
         .
         .
         .
   [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D01D C! ; 7 8065
  [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D017 C! ;  B 8065
  ELSE MASKS SPRITE# @ 7 XOR + C@ OR  THEN^ D01B C! ;             E 8065
 0:01:41.1   OK

I mentioned that Fleet Forth's editor is based on the one from Forth Dimensions volume 3 issue 3. There are some differences that I mention here.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: