6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Mar 29, 2024 9:33 am

All times are UTC




Post new topic Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Mon Jan 09, 2012 1:46 am 
Offline

Joined: Mon Jan 09, 2012 1:07 am
Posts: 5
I've never owned an Apple or coded 6502 assembly, so this is just a crazy obsession to figure out a puzzle. Out of pure curiosity, I found myself trying to understand how Woz implemented syntax rules in his integer basic. Here is the disassembled source I'm using.

http://www.easy68k.com/paulrsm/6502/INTLST.TXT

Would a kindly soul be willing to explain how the syntax rules in the syntax table are represented?

For what it's worth -- I've come to this approximate understanding of the structure of the parser/interpreter.

The parser uses a set of syntax diagrams to convert each line into a set of tokens/operators. The interpreter then executes the tokens using a pair of stacks and a precedence-operator style sequence of operations.

There is a subroutine for each operator that the interpreter uses. The addresses are available in VERBADRL (line 19960) and VERBADRH (line 20300) in the source above. The addresses are indexed by their token value -- so token $00 is the BEGIN_LINE operator, and so on.

The 2 precedence weights for each operator are also available from $e980 (line 19760 in the source above) as an interleaved pair of 4-bit values -- 1 byte per operator, and indexed by the operator's token value. So far so good.

The syntax rules should be present in SYNTABL and SYNTABL2. I can even roughly understand how the keywords are organized -- they are backwards, have $20 subtracted from them, and end either if the value exceeds $bf, or is less than $80.

Here's where i'm stuck -- I don't understand how the syntax rules are represented. I'll pick an example statement like

HLINE x1,x2 at y

The corresponding portion of the syntax table that seems most promising looks like this.

Code:
 22500   DB $67,"T"-32,"A"-32 ;AT
 22510   DB $07,","-32 ;
 22520   DB $07,"N"-32,"I"-32,"L"-32,"H"-32 ;HLIN


I'll reverse this so it's easier to see the underlying values that are puzzling me.

Code:
"HLIN" $07 "," $07 "AT" $67


So the $07 and $67 clearly denote elements of the intermediate parse diagram, but I don't understand what these numbers map to. $07 for instance, can't map to token $07 (the RUN n operator) shouldn't it map to an expression?

Any guidance would be much appreciated!
-kb


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 09, 2012 3:27 pm 
Offline

Joined: Fri Jun 27, 2003 8:12 am
Posts: 618
Location: Meadowbrook
If it helps, why not ask him on facebook or woz.org?

_________________
"My biggest dream in life? Building black plywood Habitrails"


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 09, 2012 3:54 pm 
Offline

Joined: Wed May 20, 2009 1:06 pm
Posts: 491
I happened to be searching on Amazon and ran into the book:

"Apple I Replica Creation: Back to the Garage" by Tom Owad (Author), Steve Wozniak (Author).

http://www.amazon.com/Apple-Replica-Cre ... =8-2-fkmr0

I'm sure it would answer a lot of hardware questions.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jan 09, 2012 4:03 pm 
Offline

Joined: Fri Jun 27, 2003 8:12 am
Posts: 618
Location: Meadowbrook
Also, Vince Briel's Apple 1 replica project can also shed some light, methinks.

http://www.applefritter.com/briel

_________________
"My biggest dream in life? Building black plywood Habitrails"


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 09, 2012 6:28 pm 
Offline

Joined: Thu Mar 03, 2011 5:56 pm
Posts: 277
I have a hunch that the disassembly is (slightly) incorrect, and thus misleading.

If you look at the table entries for GR and TEXT, it appears that all but the final character is encoded by subtracting $20 from the character, while the last character has $20 added. Alternatively, you could say that all characters are encoded by subtracting $20, and the final character is OR'ed with a flag value.

Then, if you look at the encoding of the expressions in the syntax table for HLIN and VLIN, you see that the first two (integer) expressions are encoded as $07, while the final is encoded as $67. Again, this looks like a token value combined with a flag (in this case, $60).

If you look at some of the other entries, they appear to be incomplete, and also interleaved with entries that are not decoded. For example, SCRN(a,b) - my guess is that the correct disassembly for this would be

")" + 32, $07, "," - 32, $07, "(" - 32, "N" - 32, "R" - 32, "C" - 32, "S" - 32

--- unfortunately, that leaves $49 as the final byte of this sequence instead of the $71 that is there. However, that is assuming that the flags used to encode the end of an entry is always $20, which some of the other entries contradict.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jan 10, 2012 1:25 am 
Offline

Joined: Mon Jan 09, 2012 1:07 am
Posts: 5
Great pointers to the sites, I'll dig around for any good stuff there. I didn't think about going back to the Woz himself -- why not!

@rwiker
The hint about the flags seems spot on -- thank you for pointing it out. Do you think the rule termination flag for a syntax-pointer-byte (for lack of a better word) might be the 6th bit ($40)?

It also looks like the last 5 bits index into a secondary 32-byte table (line 30960), whose values are presumably some way to continue parsing. I naively assumed they were offsets into the 512 byte SYNTABL structure, but I haven't been able to map the offsets in that table correctly. The access seems to happen roughly around here:

Code:
  9500    STY TOKNDXSTK,X
  9510    AND #$1F
  9520    TAY
  9530    LDA SYNTABLNDX,Y
  9540  HE491
  9550    ASL ;dbl
  9560    TAY
...


I thought this meant the good stuff would appear at
Code:
SYNTABL + 2*SYNTABLNDX[ syntax_pointer_byte & 0x1f ] + 2

as there are a couple of INYs down below, but I can't quite seem to convince myself that this is correct.

Much appreciate all the advice! Please do continue to pass along any tidbits about the details of this data structure; it seems remarkably compact and clever -- I suppose like all of Woz's designs.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Jan 11, 2012 2:54 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
The syntax table is in reverse order, as you've already noticed. An underscore is stored at the end of the line by the line input routine.

There are 32 categories (well, that's what I'm going to call them). Each category contains one or more rules.

The syntax tokens are:

  • $00: return with syntax error
  • $01: return with no error
  • $02: parse a comment character (any non-control character, stops at the end of the line)
  • $03: parse a string literal character (stops at a closing quote)
  • $04-$1F: categories $04-$1F, next token is required
  • $20-$3F: categories $00-$1F, next token (and beyond) is optional
  • $40-$5F: categories $00-$1F, end of category
  • $60-$7D: categories $00-$1D, end of rule
  • $7E: branch "backward" 2 bytes
  • $7F: branch "backward" 1 byte
  • $80: parse letter A-Z, next token is required
  • $81: parse number 0-9, next token is required
  • $82-$BF: parse character "-_ -- next token is required
  • $C0: parse letter A-Z, end of rule
  • $C1: parse number 0-9, end of rule
  • $C2-$FF: parse character "-_ -- end of rule


The category table (SYNTABLNDX in the linked listing) is at $E197. To address 512 bytes with 8 bits, each category starts at an odd address. Each entry in the category table is:

Code:
DB (ADDRESS_OF_START_OF_CATEGORY-1)/2


Category $07 is a numeric expression. So HLIN $07 AT $07 , $67 means HLIN <num_expr> AT <num_expr> , <num_expr> and the $67 indicates the end of the HLIN rule.

Category $00 is an unsigned number (line number), so LIST $20 , $60 (at $EC61) means the LIST <line_number> , <line_number> with the comma and the second line number optional

Category $19 is a numberic variable name, so NEXT $39 , $7E means NEXT <variable> , <variable> , ... i.e. NEXT A or NEXT A,B or NEXT A,B,C etc.

Sometimes keywords are duplicated to force a mismatch (and thus force the end of a category). For example, RNDX (at $ED69) can never be matched since it comes after RND (at $ED75) in the numeric function category (at $ED79). (The first three letters would match RND first.) Thus, RNDX matches nothing and effectively marks the end of the category.

If you're curious about the THEN TO STEP etc. keywords at $EC17, see this thread:

viewtopic.php?t=726

If you have further questions, feel free to ask.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 12, 2012 2:19 am 
Offline

Joined: Mon Jan 09, 2012 1:07 am
Posts: 5
Aha -- thank you for that terrific explanation, it makes everything much clearer. The hint about the conditions for terminating variable names was a good one, I'd never have figured it out. I guess discarding spaces in a language leads to sticky situations!

I take it that the parser maintains a stack of rules for a given category, backtracking to each rule until the first match occurs (or a syntax error if it runs out of rules?)

I was still somewhat perplexed around the rules surrounding variable parsing, ie, category $19. I'll format each category as I thought I saw it -- one line per rule. First, the single rule in the $19 category:

Code:
[A-Z] [$1c]


So it's alphabet followed by category $1c, and $1c has four rules like:
Code:
[$1d] [go back 1]
+ [$00]
- [$00]
[$00]


category $00 matches 1 or more digits, so I'll skip that and look at category $1d, which indeed contains the terminating rules as you'd pointed out:
Code:
THEN
TO
....
AT
[A-Z]
[$0f]


and finally category $0f maps to a single digit in a slightly tricky way because the second rule won't kick in.

Code:
[0-9]
[0-9] ....


Going back up these diagrams, I wasn't seeing a way to match a single alphabet variable name. Maybe I made a mistake in these diagrams? Or does the "go back 1 byte" branch also implicitly make the previous token an optional one?

Thanks!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 12, 2012 3:39 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
kbs wrote:
The hint about the conditions for terminating variable names was a good one, I'd never have figured it out.


That was the one that took me the longest to figure out.

kbs wrote:
I take it that the parser maintains a stack of rules for a given category, backtracking to each rule until the first match occurs (or a syntax error if it runs out of rules?)


It stores a pointer to the category on the zp "stack". It just goes through the rules in a category one by one in order until it reaches a match. If none match, pop the previous category pointer, and try the the next rule in the popped category.

kbs wrote:
Code:
[$1d] [go back 1]
+ [$00]
- [$00]
[$00]



I'll have to dig up my disassembly listing (which is more detailed for the syntax table), since it's been a while since I worked through this, but I'm pretty sure those rules are for INPUT. (When INPUT is used with a numeric variable e.g. INPUT A, it calls the parser to turn the number into an integer. See the code at $EBAA.)

Quote:
Or does the "go back 1 byte" branch also implicitly make the previous token an optional one?


$7E/$7F don't control whether something is optional. Optional only depends on whether the preceding token is $20-$3F. (The AND #$E0 CMP #$20 near $E448 is where the decision is made, I think.) If I recall correctly, there's a sneaky use of $20 as padding (before a category) somewhere, to force optional behavior. I don't remember where it is, offhand.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jan 12, 2012 5:46 pm 
Offline

Joined: Mon Jan 09, 2012 1:07 am
Posts: 5
Thank you for taking the time to post the explanations, it's neat to understand how the pieces start to fit together. If you do find your annotated disassembly, do you think you might be able to scan or post it online? I'm sure it would make for a very interesting read.

The note about how the optional flag actually works (by examining the preceding byte at a token) was very useful, and a crafty trick. And ha! The byte preceding the first rule in the [$1c] category is in fact $20, so it's starting to look a little less mysterious and the one-character variable name parsing at least, is making more sense.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jan 15, 2012 9:30 pm 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
A few errors I made above:

  • $01 (NOT $03) is parse a string literal character
  • $03 (NOT $01) is return with no error
  • $F197 (NOT $E197) is the address of category index table


There's a partial disassembly of Integer BASIC -- the parser, the syntax table, and the precedence table (the precedence table is not all that interesting in and of itself, but I've included it because it provides some context for common verb names, like comma) -- here:

http://www.lowkey.comuf.com/

It's meant for information purposes not reassembling; if you were to try assembling it, there's probably going to be a number of unresolved labels (though all should be of the L<hex_address> variety). I've also included a syntax table chart that I generated back when I was trying to figure out how the parser worked.

(It looks like you may have miscalculated a start-of-category address in your earlier post.)

By the way, this post also briefly mentions the Integer BASIC parser:

viewtopic.php?p=3282#3282

Also, Woz wrote an article that appeared in the May 1977 issue of Byte magazine (I've seen copies on the web) which also describes a bit of the design of Integer BASIC.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jan 17, 2012 11:29 pm 
Offline

Joined: Mon Jan 09, 2012 1:07 am
Posts: 5
Nice! Thank you for posting your notes -- so far, it seems to answer pretty much everything I was banging my head against :-)

The Woz article was indeed what got me curious about all this, it's a fun read. I'll post a link here for anyone who looks at this thread in the future.

http://www.downloads.reactivemicro.com/ ... iption.pdf

He talks briefly about the precedece-style execution (ie: how to interpret the tokenized bytes) but not much about the parser itself. I thought it was an interesting tradeoff to do the precedence computation at execution time rather than converting operators to postfix during parsing. It was something you'd in fact pointed out much more clearly (along with much else) in one of your posts in viewtopic.php?p=3282#3282 . But it clearly simplifies the LIST algorithm, and I figured it was probably an unusual, but decent tradeoff for where he was at with everything.

In any case, I started to become curious about how he'd done the parsing, especially looking at a comment from folklore.org where he mentions a few more tidbits:

http://www.folklore.org/StoryView.py?pr ... comments=1

The comment is interesting if only as a testament to his self-taught genius. He briefly tossed out a couple of lines about the parsing piece:

Quote:
The 6502 was the 'latest' microprocessor at this point in time (late 1975). I had a hunch that nobody had written a BASIC for it and I had a chance to be the first. I started by learning BASIC a little - studying HP BASIC from a manual in our [calculator division] lab. I decided to put a syntax chart right into memory and use it to scan input by the user.


Hm, that got me wondering what exactly that meant. He's also mentioned elsewhere http://www.woz.org/letters/remember-zx80 that this approach led to a very modular way to extend the syntax, and a quote from there:

Quote:
I could go on. The BASIC turned out extremely modular, so I could easily add something by adding some syntax descriptions in near-text form, and write routines for the new functions or ops that were needed. The language didn't have to be rewritten.


By now I was really curious now about the parser. A lot of people have spoken about his hardware wizardry, but as a programmer, I wanted to see how he approached software; and thus began my saga -- thank you for helping me understand this corner of the Apple ][ software a little better.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: