6502.org • View topic - Search Engine in Forth

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Forth

All times are UTC

Search Engine in Forth

Page 1 of 1

[ 9 posts ]

Previous topic | Next topic

Author

Message

IamRob

Post subject: Search Engine in Forth

Posted: Fri Mar 12, 2021 8:09 pm

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357

In an earlier thread I created a HELP search engine that returned information about a word delimited by a colon.

I tried applying the same search engine to finding any word within a number of screens. But it is a lot slower than I would like it to be. The problem is that using the space as a delimiter, then every word has to be checked for a match.

In the HELP engine, a match is only checked when it comes across a colon. A lot fewer matches are made and the search is quite a bit faster.

Even though a linear search is done, and a match is checked on every word, I still expected it to be faster that it is. The order of the search is this:

1) LOAD SCREEN
2) ENCLOSE each word delimited by a SPACE
3) use MATCH to check each word against the SEARCH word
4) print screen# and line# of each matched word found
5) if not end of screen, increment pointer and return to step #2
6) if not last screen, increment screen and return to step #1
7) end

I limited the search to 16 screens ( 0 SEARCH searches 00..$0F, $10 SEARCH searches $10..$1F and so on ) which is same as searching 16k in memory. At 1 Mhz, it takes 1 min 10 secs to search all 16 screens for each occurrence of the word.

I know I could make the search into a primitive, but I was just wondering if there might be a different way to search random words to make the search programmed entirely in Forth go faster. Reducing the number of times MATCH is used (like in HELP) helped speed things up. But doing a search on every word delimited by spaces is a lot slower. In HELP, a search for a colon is done before using ENCLOSE then entering MATCH.

Any ideas?

Top

SamCoVT

Post subject: Re: Search Engine in Forth

Posted: Fri Mar 12, 2021 9:03 pm

Joined: Sun May 13, 2018 5:49 pm
Posts: 247

How about a binary search? It would require your words to be in ASCIIbetical order. For a given range of screens, start on the "center" screen and read just the first word. Determine if the word you want is this word you are looking at, or if it should be on the right or left side of this word.

Once you've narrowed it down to a single screen, you can either do a linear traversal of the screen like you were doing before or you can continue binary searching within the screen (eg. go to the middle line) to narrow your search. While it would be nice to have all of the entries the same size to make it much faster to locate the start of a word, you could just start anywhere and move forward to find the delimiter (eg. : ) taking care to handle the "end of the screen" case. Then read the next word and check it.

With 8 screens, it would probably take no more than 10 checks (no more than 3 to determine the screen it should be on and probably no more than 7 more checks within that screen to locate (or determine it doesn't exist) a word with your data.

If you evenly spaced your words, it would take more space but be faster to calculate where a word started rather than hunting for a delimiter.

Top

IamRob

Post subject: Re: Search Engine in Forth

Posted: Sat Mar 13, 2021 5:04 am

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357

SamCoVT wrote:

I should have mentioned.

The screens contain word definitions and are not a list of words, so can't be alphabetized. Also, all occurrences of stated word are wanted to be listed, so have to scan all word definitions on all 16 screens.

One thought I came up with was to only search for the starting letter of the search word. There are a lot more spaces than there are words that start with the beginning letter. The problem comes when using the same code as the HELP engine. Using a colon greatly speeds up the HELP search as it scans for a colon before making a match. This eliminates checking all the info data of each word and only focuses on the primary word.

But the colon search is interfering with the word search as the pointer is incremented 1 position past the colon before entering the MATCH routine. When searching for a starting letter, the incremented position doesn't work and adds quite a bit of extra code to check for it. I was hoping to be able to use the search engine for both instances, but it looks like I may have to write a 2nd word definition for a separate search.

Top

SamCoVT

Post subject: Re: Search Engine in Forth

Posted: Wed Mar 17, 2021 6:33 pm

Joined: Sun May 13, 2018 5:49 pm
Posts: 247

Do all the definitions start at the left edge of the screen? If that's the case, you only need to check 16 times for a colon at the left edge, and can then check for the name only on lines that are actually definitions.

If, on the other hand, you're "squishing" as many definitions into a screen as you possibly can, then you're going to be stuck doing a linear search if the list is also unordered. I would at least consider alphabetizing the words within each screen to allow a faster search if you're going to be doing it often enough that you really need better performance.

One other thought is to modify your dictionary structure to include the screen a word came from in the header for each word. If you don't mind the extra memory use in the headers, you could include the line number as well. Words that did not come from a screen could use the value 0, as you can't LOAD from screen 0. This would make your searches the same speed as a dictionary lookup.

Top

JimBoyd

Post subject: Re: Search Engine in Forth

Posted: Wed Mar 17, 2021 8:10 pm

Joined: Fri May 05, 2017 9:27 pm
Posts: 855

SamCoVT wrote:

I got the impression IamRob was referring to searching the source to find each time a given word is used in the definition of other words.

Quote:

One other thought is to modify your dictionary structure to include the screen a word came from in the header for each word. If you don't mind the extra memory use in the headers, you could include the line number as well. Words that did not come from a screen could use the value 0, as you can't LOAD from screen 0. This would make your searches the same speed as a dictionary lookup.

Some Forth's have this feature. With my Forth the view field ( that extra cell in the header) can be included by setting a variable called VIEW to any non zero value and disabled by setting it to zero. If present, the view field is the first field of the header.
Typing LOCATE FOOBAR will list the screen where FOOBAR was loaded from.
None of the kernel words in Fleet Forth have a view field.

Top

IamRob

Post subject: Re: Search Engine in Forth

Posted: Thu Mar 18, 2021 4:10 am

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357

JimBoyd wrote:

SamCoVT wrote:

I got the impression IamRob was referring to searching the source to find each time a given word is used in the definition of other words.

That is correct. Not just the colon word gets checked, but also all definition words as well. The search comes in real handy when finding every instance of a variable or a word that needs renaming.

Quote:

Instead of each word having an extra byte in the header for the screen number, I put a list of all the colon words in line #0 of each screen, which can be seen with INDEX. It is quick and easy to list the Indexes of each screen to find a word definition. Ultimately if I just want to find a colon definition, I would just search line #0.

Top

JimBoyd

Post subject: Re: Search Engine in Forth

Posted: Fri Mar 19, 2021 11:54 pm

Joined: Fri May 05, 2017 9:27 pm
Posts: 855

IamRob wrote:

With the Forths that I am familiar with, LOADing a screen will cause that screen to be interpreted and or compiled. Do you mean read that screen into a buffer?

Fleet Forth's editor is based on the Forth Inc. line editor presented in Forth Dimensions volume 3 issue 3. It has a word S which takes a block number and parses text up to a delimiting ^ and searches all blocks from the current one to one less than the block number that was on the stack. I ran some tests by copying a range of sixteen blocks to blocks used by the C64 Ram Expansion Module to eliminate disk access time as a variable. Depending on the text in the screens and what was searched for, I got results ranging from 11.6 to 20.2 seconds. Some of that time is the time to TYPE the information and the time it takes the Commodore 64 to scroll.

Code:

 OK
CLEAR.TIME $10 FH S  LDA ^ ELAPSED 
   IP )Y LDA ^ PHA  IP INC                                        2 8001 
   IP )Y LDA ^                                                    4 8001 
   IP )Y LDA ^ W 1+ STA  DEY                                      E 8001 
   IP )Y LDA ^ W    STA  CLC                                      F 8001 
   IP LDA ^ 2 # ADC  IP STA                                       1 8002 
   IP )Y LDA ^ PHA  TYA                                           3 8003 
      0 ,X LDA ^ N ,Y STA                                         9 8003 
   0FE ,X LDA ^ 0FF ,X ORA                                        4 8005 
      IP LDA ^ 2 # ADC  IP STA                                    8 8005 
   IP )Y LDA ^ PHA  INY                                           2 8006 
   IP )Y LDA ^ IP 1+ STA                                          3 8006 
   IP )Y LDA ^ PHA  DEY                                           4 8007 
   IP )Y LDA ^ PHA  CLC                                           5 8007 
   3 ,X LDA ^ 80 # ADC  PHA                                       6 8007 
   3 ,X STA  2 ,X LDA ^ PHA  SEC                                  7 8007 
   0 ,X LDA ^ 2 ,X SBC  TAY                                       8 8007 
   1 ,X LDA ^ 3 ,X SBC  PHA                                       9 8007 
   0 ,X LDA ^ 2 ,X CMP                                            3 8008 
   1 ,X LDA ^ 3 ,X CMP                                            5 8008 
   0FE ,X LDA ^ 0FF ,X ORA                                        8 800A 
   0FE ,X LDA ^ 0FF ,X ORA                                        7 800B 
   0FE ,X LDA ^ 0FF ,X ORA                                        C 800B 
   0FF ,X LDA ^ PHA  0FE ,X LDA                                   4 800C 
   0FF ,X LDA  PHA  0FE ,X LDA ^                                  4 800C 
   2 ,X LDA ^ 2 # SBC  N 2+  STA                                  5 800D 
   3 ,X LDA ^ 0 # SBC  N 3 + STA                                  6 800D 
   0 ,X LDY  1 ,X LDA ^                                           7 800D 
         N )Y LDA ^ TAX  INY                                      E 800D 
         N )Y LDA ^                                               F 800D 
         N )Y LDA ^                                               4 800E 
            N 2+ )Y LDA ^                                         A 800E 
         0 # LDA ^                                                3 800F 
         N )Y LDA ^ IBIT # AND                                    6 800F 
            1 # LDA ^                                             9 800F 
         0FF # LDA ^ TAY                                          4 8010 
      N 4 + )Y LDA ^                                              8 8010 
      N 4 + )Y LDA ^                                              C 8010 
 0:00:20.2   OK
CLEAR.TIME $10 FH S  PHA ^ ELAPSED 
   IP )Y LDA  PHA ^ IP INC                                        2 8001 
   IP )Y LDA  PHA ^ TYA                                           3 8003 
   IP )Y LDA  PHA ^ INY                                           2 8006 
   IP )Y LDA  PHA ^ DEY                                           4 8007 
   IP )Y LDA  PHA ^ CLC                                           5 8007 
   3 ,X LDA  80 # ADC  PHA ^                                      6 8007 
   3 ,X STA  2 ,X LDA  PHA ^ SEC                                  7 8007 
   1 ,X LDA  3 ,X SBC  PHA ^                                      9 8007 
   TYA  PHA ^                                                     A 8007 
   0FF ,X LDA  PHA ^ 0FE ,X LDA                                   4 800C 
 0:00:13.4   OK
CLEAR.TIME $10 FH S  ADC ^ ELAPSED 
   IP LDA  2 # ADC ^ IP STA                                       1 8002 
      IP LDA  2 # ADC ^ IP STA                                    8 8005 
   3 ,X LDA  80 # ADC ^ PHA                                       6 8007 
   102 ,X ADC ^ 102 ,X STA                                        7 8009 
   101 ,X ADC ^ 101 ,X STA                                        6 800C 
         TYA  N ADC ^ 0 ,X STA                                    2 800F 
         N 1+ ADC ^ 1 ,X STA                                      4 800F 
 0:00:12.7   OK
78 1 RAM $10 BMOVE 
   87  8010
   86  800F
   85  800E
   84  800D
   83  800C
   82  800B
   81  800A
   80  8009
   7F  8008
   7E  8007
   7D  8006
   7C  8005
   7B  8004
   7A  8003
   79  8002
   78  8001 OK
CLEAR.TIME $10 FH S  DUP ^ ELAPSED 
   R> DUP ^2+ >R @ >BODY ! ;                                      3 8002 
   R> DUP ^COUNT + >R ;                                           9 8002 
   AP0 @ DUP ^! ;                                                 F 8002 
   DUP ^>IN @ UMIN                                                9 8005 
   DUP ^NEGATE  UNDER+ UNDER+ ;                                   A 8005 
   DUP ^0                                                         B 8006 
   DUP ^0<> + NEGATE ROT +                                        5 800A 
      DUP ^SCR ! CR ." SCR# " U.                                  8 800C 
 0:00:12.9   OK
CLEAR.TIME $10 FH S  DROP ^ ELAPSED 
   DROP ^PAUSE ;                                                  6 8001 
   DROP ^PAUSE ;                                                  C 8001 
      'STREAM DROP ^ >IN @ C/L /MOD                               9 800C 
 0:00:11.6   OK
CONSOLE 

The word FH ( from here ) when editing, will add the number in SCR to the value on the stack. When loading a block, it will add the value in BLK to the number on the stack.

The heart of the editor's search function is the word MATCH . It takes the address and count of a string to search ( in this case, the address of the block buffer and $400) and the address and count of the string to find ( in this case, the contents of the editors find buffer, FBUF , a 65 byte buffer). MATCH returns the offset to the character just past the search string in the string to search and a flag.

Top

IamRob

Post subject: Re: Search Engine in Forth

Posted: Sat Mar 20, 2021 6:07 am

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357

JimBoyd wrote:

The heart of the editor's search function is the word MATCH . It takes the address and count of a string to search ( in this case, the address of the block buffer and $400) and the address and count of the string to find ( in this case, the contents of the editors find buffer, FBUF , a 65 byte buffer). MATCH returns the offset to the character just past the search string in the string to search and a flag.

I see a couple of possible reasons you might be getting good speed results.
First you are searching for a very popular word "LDA ^" that is only 6-10 chars in on each line. If the rest of the line is bypassed after a MATCH has been made and it drops down to the next line, that could speed up the search significantly.

Also your lines have a lot of spaces at the midpoint of each line, which can also be bypassed or skimmed over very quickly. I made some adjustments taking these two ideas into consideration and was able to reduce the search by about 20 secs.

If I find I am using SEARCH quite a bit, I might make a primitive definition out of it yet.

Thanks

Top

JimBoyd

Post subject: Re: Search Engine in Forth

Posted: Sun Mar 21, 2021 9:05 pm

Joined: Fri May 05, 2017 9:27 pm
Posts: 855

No, Fleet Forth's editor word S does not skip white space nor does it skip to the next line when a match is found. Here is another example. I've copied some screens from Blazin' Forth's system loader to blocks in the C64's Ram Expansion Unit. I'm using screens from Blazin' Forth's system loader because the source in these screens is densely packed.

Code:

 OK
CLEAR.TIME $10 FH S THEN^ ELAPSED 
N 1+ SBC, 0= NOT IF, 0FF # LDA, N STA, THEN^, N LDA, D001 ,Y STA, 5 8002 
N 2+ STA, THEN^, N 2+ LDA, D000 ,Y STA, N 3 + LDA, 0= NOT         8 8002 
THEN^, D010 STA, XSAVE LDX, NEXT JMP, END-CODE                    A 8002 
     ELSE 2DUP < IF SWAP THEN^ R@ MOVES 2+ ! MOVES OFF R> THEN    6 8003 
     ELSE 2DUP < IF SWAP THEN R@ MOVES 2+ ! MOVES OFF R> THEN^    6 8003 
     DRAW-TURTLE #ON C@ 80 OR DUP D015 C! #ON ! TPOS THEN^ ;      B 8005 
     01 DC0D C! (SPLIT) 0314 ! 1 D01A C! SPLIT? ON THEN^ ;        3 8006 
  0 D01A C! EA31 0314 ! 81 DC0D C! GRON GBACK @ D020 C! THEN^ ;   6 8006 
  N ORA, N 1+ )Y STA, INY, 0= IF, N 2+ INC, DEX, THEN^,           5 8007 
   IF GBACK @ DUP 53280 C! 53281 C! THEN^                         E 8007 
   MULTI @ NOT IF HRBG THEN^ ;                                    F 8007 
            ELSE 0F AND DUP PEN ! D02E C! THEN^ ;                 C 8008 
   PENSTATE @ IF ASPECT DLINE ELSE 2DROP THEN^                    4 8009 
   TURTLESTATE @ IF TPOS THEN^ ;                                  5 8009 
    DUP 0< IF 360 + THEN^ 360 MOD HEADING !                       B 8009 
   TURTLESTATE @ IF DRAW-TURTLE THEN^ ;                           C 8009 
CODE (RY) ?Y JSR, CS IF, 0 # LDA, ELSE, 255 # LDA, THEN^,         3 800B 
CODE (TX) ?X JSR, CS IF, 0 # LDA, ELSE, 255 # LDA, THEN^,         5 800B 
: ?WINDOW IF ND TRUE ABORT" WINDOW OUT OF RANGE" THEN^ ;          7 800B 
   SPLIT? @ IF SP 0 19 CURSOR THEN^ TSET ;                        D 800B 
 D01C C@ [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^     9 800C 
   [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D01D C! ; 7 800E 
  [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D017 C! ;  B 800E 
  ELSE MASKS SPRITE# @ 7 XOR + C@ OR  THEN^ D01B C! ;             E 800E 
 0:00:16.4   OK
CONSOLE 

Notice this part of the system log here:

Code:

     ELSE 2DUP < IF SWAP THEN^ R@ MOVES 2+ ! MOVES OFF R> THEN    6 8003 
     ELSE 2DUP < IF SWAP THEN R@ MOVES 2+ ! MOVES OFF R> THEN^    6 8003 

where both occurrences of the string THEN on line 6 was found. Since it is searching for matching strings, and not matching Forth words, it also finds the search string THEN in Blazin' Forth's assembler's THEN, with a trailing comma.

Code:

N 1+ SBC, 0= NOT IF, 0FF # LDA, N STA, THEN^, N LDA, D001 ,Y STA, 5 8002

Note that the ^ (carat) character is not part of the search string. It is not in the source. It is used by S as a delimiter. It also appears in the display to show where the search string was found on the line. S only needs the delimiter if there are any trailing blanks in the search string or if there are words to be executed right after S
Here is a quick timing word I threw together to demonstrate:

Code:

: TIMEIT 
   CLEAR.TIME ' EXECUTE ELAPSED ;

In the Forth-83 Standard, ' (tick) is not immediate so it gets compiled.
Here is a portion of a system log using that word. I cut out most of it because it is 140 lines long.

Code:

 OK
#105 FH TIMEIT S THEN 
: IS   STATE @ IF COMPILE (IS) ELSE ' >BODY ! THEN^ ; IMMEDIATE   A 8003 
         .
         .
         .
   [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D01D C! ; 7 8065 
  [ MASKS 8 + ] LITERAL SPRITE# @ 7 XOR + C@ AND THEN^ D017 C! ;  B 8065 
  ELSE MASKS SPRITE# @ 7 XOR + C@ OR  THEN^ D01B C! ;             E 8065 
 0:01:41.1   OK

I mentioned that Fleet Forth's editor is based on the one from Forth Dimensions volume 3 issue 3. There are some differences that I mention here.

Top

Page 1 of 1

[ 9 posts ]

Board index » 6502.org Users Forum » Forth

All times are UTC

Who is online

Users browsing this forum: No registered users and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum