Crenshaw - Let's Build a Compiler

drogon · Post by **drogon** » Wed Oct 16, 2024 10:28 am

barnacle wrote:

I tend to agree. As mentioned earlier, my code ignores spaces between the line number and the first keyword, and indents automatically on list. But it maintains spaces thereafter. It doesn't even know what a tab is at this point.

Code: Select all

10 for q = 1 to 10
12345 for q = 1 to 10

both take 16 characters (2 bytes for line number, one byte line length, one byte each for 'for' and 'to', the text, and a terminating 0x0a. While

Code: Select all

10forq=1to10
12345forq=1to10

uses five characters fewer - the spaces, ignoring the first.

Neil

Traditionally (I guess) a Tiny Basic just stores text after the line number. The first ones AIUI didn't even store a line length but terminated the lines with a CR. Mine has a length byte, but contrary to the usual 6502 convention stores the high byte first. The reason for this is that the top bit is used to denote the end of program text - this limits line numbers to 0-32767 which is no big deal for a TB I reckon, and makes line insertion/deletion and LIST a little easier to manage (I think).

But all those spaces add up in terms of program size (which is an issue when you just have 4K of RAM with 512 of those precious bytes used for ZP and stack) and speed - each space takes measurable microseconds to skip over... It's always a compromise!

-Gordon

barnacle · Post by **barnacle** » Wed Oct 16, 2024 11:34 am

Yep, the single length byte to speed up moving through the source... I have a find-this-line routine which is much faster as a result.

Remember 'renumber' and 'compress' programs, that moved all the lines and ate all the spaces... and created completely unreadable source

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Oct 16, 2024 4:20 pm

BigEd wrote:

My personal preference is never to insert or retain tabs in source code. Tab as a key on the keyboard, which the editor can use to insert an appropriate amount of space, is a good way to go, in my opinion.

The UltraEdit editor, which I heavily use, may be configured so striking [Tab] will insert an arbitrary number of spaces in place of a single tab. Unfortunately, the editor in the Kowalski package doesn’t have that feature, although it can be configured to render a tab as a configurable number of blanks on the screen—however, a tab is stored in the file. Annoyingly, as Garth infers, tab rendering in programs is arbitrary, which can produce a display that has little resemblance to what was originally intended. Web browsers, in particular, seem to have no standard way to render a tab.

However, real tabs have their place in a data stream. For example, I wrote a program that can read the address book used by the Mozilla Thunderbird E-mail client and generate a list consisting of E-mail addresses and matching names (i.e., the display name field in each address book record), with the output sorted by name—the resulting list can be read and parsed by external programs. Since the name field will likely have at least one blank, a blank obviously cannot be considered a field separator. So I use <HT> (horizontal tab), which, conveniently, is easy for BASH and PHP to parse for word-splitting purposes.

As Garth notes, one of the reasons for the use of <HT> as a field separator (aside from ease of parsing) is the desire to make files smaller and faster-loading. My professional computing experience began during a time when file-size conservation was front-and-center in any program’s design philosophy—something I unconsciously continue to perpetuate in my code.

Also, since all ASCII values below 32 are control values, it’s a snap to distinguish in-band control information from actual data. Taking advantage of the control range means quite a bit of metadata can be embedded into a data stream without consuming a lot of precious space—the programmer is free to interpret those 31 control codes as he or she sees fit.¹

It seems much of the ASCII control range is neglected in contemporary programs. I guess the current thinking now is that we’ve got gigs of RAM, terabytes of disk space, and MPUs running a thousand times faster than what we had when I started out doing this stuff. So who needs to worry about file sizes?

————————————————————
¹I don’t consider <NUL> to be a true control character—in string data, I only use it as a terminator.

barnacle · Post by **barnacle** » Wed Oct 16, 2024 6:34 pm

BigDumbDinosaur wrote:

————————————————————
¹I don’t consider <NUL> to be a true control character—in string data, I only use it as a terminator.

I do the same. After all, if it was useful as a character, it would surely be a different colour!

Neil

barnacle · Post by **barnacle** » Wed Oct 16, 2024 7:26 pm

It's moving along... it can do if and goto now:

Code: Select all

Neolithic Tiny Basic 65c02
> 10 let a = 1
> 20 print "Tiny Basic "; a
> 30 let a = a + 1
> 40 if a <= 10
> 50 goto 20
> 60 endif
> 70 end
> run
Tiny Basic 1 
Tiny Basic 2 
Tiny Basic 3 
Tiny Basic 4 
Tiny Basic 5 
Tiny Basic 6 
Tiny Basic 7 
Tiny Basic 8 
Tiny Basic 9 
Tiny Basic 10

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Oct 16, 2024 8:30 pm

barnacle wrote:

BigDumbDinosaur wrote:

————————————————————
¹I don’t consider <NUL> to be a true control character—in string data, I only use it as a terminator.

I do the same. After all, if it was useful as a character, it would surely be a different colour!

Well, the name of the zero byte is “NUL,” not “DUL.”

In digital electronics, a low lights up a logic probe’s green LED, so I’d guess $00 is green as well.

Yuri · Post by **Yuri** » Wed Oct 16, 2024 10:23 pm

Kinda hijacked the conversation here....

BigDumbDinosaur wrote:

BigEd wrote:

My personal preference is never to insert or retain tabs in source code. Tab as a key on the keyboard, which the editor can use to insert an appropriate amount of space, is a good way to go, in my opinion.

The UltraEdit editor, which I heavily use, may be configured so striking [Tab] will insert an arbitrary number of spaces in place of a single tab. Unfortunately, the editor in the Kowalski package doesn’t have that feature, although it can be configured to render a tab as a configurable number of blanks on the screen—however, a tab is stored in the file. Annoyingly, as Garth infers, tab rendering in programs is arbitrary, which can produce a display that has little resemblance to what was originally intended.

Outside of word processing programs and the like, this has usually been why I tend to avoid using them. In most word processing you can explicitly set where in the document tabs lie and how they behave; and those details get stored along with the document.

To be honest, I don't think I ever generally worried too much about a few spaces vs. a tab even when I was working on my 386 and storing stuff on floppy. Considering that files all are going to take a multiple of 512byte blocks I don't think i ever noticed the difference in disk usage.

Quote:

Web browsers, in particular, seem to have no standard way to render a tab.

As I recall, white space in HTML is all usually treated like a single space character used for token/word separation. I'd have to track down the specs for it to be certain though.

and other entities were intended to encode special formatting characters like tab when they are needed. The rules also change depending on the block element your text is in. (E.g. <pre> or <code>)

I want to say those rules are defined clearly in the HTML 4.01, 5.0 and XHTML 1.0 DTDs, but I can't swear to that; I haven't looked at them in years. Really haven't needed them for any of the modern web stuff I work on day to day to be honest; formatting is largely handled with CSS these days.

Quote:

However, real tabs have their place in a data stream. For example, I wrote a program that can read the address book used by the Mozilla Thunderbird E-mail client and generate a list consisting of E-mail addresses and matching names (i.e., the display name field in each address book record), with the output sorted by name—the resulting list can be read and parsed by external programs. Since the name field will likely have at least one blank, a blank obviously cannot be considered a field separator. So I use <HT> (horizontal tab), which, conveniently, is easy for BASH and PHP to parse for word-splitting purposes.

To this day most spreadsheets still will happily import a CSV, TAB or other delineated text file just fine.

Quote:

As Garth notes, one of the reasons for the use of <HT> as a field separator (aside from ease of parsing) is the desire to make files smaller and faster-loading. My professional computing experience began during a time when file-size conservation was front-and-center in any program’s design philosophy—something I unconsciously continue to perpetuate in my code.

Also, since all ASCII values below 32 are control values, it’s a snap to distinguish in-band control information from actual data.

All well and good until you run out of control codes to mean stuff; there are only 32 (31) of them after all; how would I, for example, encode details about the selected font for block of text in a Word processor when I could have any list of fonts installed on my computer independent of the list on your computer?

At that point you have to add additional meta data about what needs to be loaded to render that correctly. Sure you could add a control code that says, "switch to next listed font" but at that point it isn't much different to use <font name="foo" /> which can then be edited by hand. Yea, a set of control codes could reduce the size of that file, and if space is a premium that would be a thing, but if readability is what you need, then it is a determent.

Quote:

Taking advantage of the control range means quite a bit of metadata can be embedded into a data stream without consuming a lot of precious space—the programmer is free to interpret those 31 control codes as he or she sees fit.¹

I can see upsides and downsides to that. If you're working on a singular program that doesn't need to worry about interoperability too much that'd be fine; and certainly something I've done myself.

But when it comes to trying to make a format that can at least be parsed by many different systems, that idea starts to quickly break down. Like it as not, plain text formats have been a thing for a long time and continue to be a thing if only because they are just easy to work with.

Quote:

It seems much of the ASCII control range is neglected in contemporary programs. I guess the current thinking now is that we’ve got gigs of RAM, terabytes of disk space, and MPUs running a thousand times faster than what we had when I started out doing this stuff. So who needs to worry about file sizes?

IDK if neglected is really the term I'd use. Forgotten is probably a more likely answer. I would imagine most people know what CR and LF do in modern systems, but i'd wager they have no clue where they got the names "carriage return" and "line feed" from.

Kinda like this little gem:

*face palm*

Quote:

¹I don’t consider <NUL> to be a true control character—in string data, I only use it as a terminator.

I should point out that it also all depends on the character encoding. ASCII is one such encoding, I tend to do a lot of work with GSM 03.38 which encodes character 0 as the at sign. (among other oddities)

GSM 03.38

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Oct 17, 2024 1:31 am

Yuri wrote:

To be honest, I don't think I ever generally worried too much about a few spaces vs. a tab even when I was working on my 386 and storing stuff on floppy. Considering that files all are going to take a multiple of 512byte blocks I don't think i ever noticed the difference in disk usage.

I started working with computers before microprocessors even existed.

The first system I professionally programmed had 8K of RAM...that’s 8 kiloBITs, not bytes.

You can be sure we were “encouraged” to economize so code and data would fit into available core.

Quote:

Web browsers, in particular, seem to have no standard way to render a tab.

As I recall, white space in HTML is all usually treated like a single space character used for token/word separation. I'd have to track down the specs for it to be certain though.

White space that consists of blanks (ASCII 32) is rendered as a single space. An extended blank (ASCII 160) is treated by most browsers as a distinct character—a string of them will be rendered as an equivalent number of blanks like this.

Browser behavior when encountering a tab character varies. Some browsers seem to render a tab as a series of blanks. I’m not up on the latest HTML standard, so I have no clue what a browser is supposed to do when it encounters a tab or other control character in HTML.

Quote:

However, real tabs have their place in a data stream...

To this day most spreadsheets still will happily import a CSV, TAB or other delineated text file just fine.

Who’s talking about a spreadsheet? I’m referring to computing in general. BTW, I’ve never found a good use for a spreadsheet.

Quote:

...since all ASCII values below 32 are control values, it’s a snap to distinguish in-band control information from actual data.

All well and good until you run out of control codes to mean stuff...how would I, for example, encode details about the selected font for block of text in a Word processor when I could have any list of fonts installed on my computer independent of the list on your computer?

I can tell you haven’t worked enough with real computers!

Just kidding!

One of the most useful of the 31 control codes is <ESC>. If you look very carefully at how devices such as “dumb” terminals (which, starting with the WYSE 60, became quite smart) work, or how a modern page printer understands what the computer is telling it to do, you will see how it is possible to extend those 31 control codes almost to infinity by beginning your control sequence with <ESC>.

Quote:

Yea, a set of control codes could reduce the size of that file, and if space is a premium that would be a thing, but if readability is what you need, then it is a determent.

The internal format of, say, a word processor file is rarely of concern to the end user—few people are going to care how the word processor interprets page formatting commands. Wordperfect, the word processor with which I am most familiar, stores metadata in a file as binary, with the encoding preceded by <ESC>. At least in the sane part of the computing world, <ESC> is understood to be a non-printing control code that will be followed by an application-defined sequence of bytes—some readable, and some not—that tell the program to change fonts, indent a block, brew some coffee, etc.

I, as a user, am interested in my documents being properly formatted, but I don’t particularly care if the actual mumbo-jumbo that, say, selects italic Helvetica as the current font is human-readable—I'm not going to do a hex dump of the file. All I care about is what ends up on the printed page.

Quote:

Taking advantage of the control range means quite a bit of metadata can be embedded into a data stream without consuming a lot of precious space—the programmer is free to interpret those 31 control codes as he or she sees fit.¹

I can see upsides and downsides to that. If you're working on a singular program that doesn't need to worry about interoperability too much that'd be fine; and certainly something I've done myself.

But when it comes to trying to make a format that can at least be parsed by many different systems, that idea starts to quickly break down. Like it as not, plain text formats have been a thing for a long time and continue to be a thing if only because they are just easy to work with.

Yes, but how do you explain the near-universality of Hewlett-Packard’s printer control language (PCL), which is not a plain-text format? PCL commands always start with either <ESC> or certain other ASCII control codes, e.g., <FF> (ASCII 12) to dump the image buffer to the page and then eject it. A typical PCL command might be <ESC>&l0O, which selects portrait orientation. Don’t you think if H-P had thought human readability was important they would have instead implemented <orientation portrait> or similar, instead of some ESCape mumbo-jumbo? H-P did what they did because they were mostly concerned about throughput...the less overhead passed in the data stream, the better the throughput.

Clearly, the ability of disparate systems to parse a data stream couldn’t been much of a problem with the widely-used ANSI/ECMA control sequences used with many displays, including the Linux console. A typical ANSI/ECMA sequence starts with that ubiquitous <ESC> and finishes with mumbo-jumbo that even someone like me who has been working with it for some forty years still can’t decipher on sight.

Quote:

¹I don’t consider <NUL> to be a true control character—in string data, I only use it as a terminator.

I should point out that it also all depends on the character encoding. ASCII is one such encoding, I tend to do a lot of work with GSM 03.38 which encodes character 0 as the at sign. (among other oddities)

ASCII is the character-encoding standard in the computing world and has been so since the 1960s, IBM’s EBCDIC notwithstanding. GSM is primarily a telecommunications thing—it’s not something that an E- mail server would use to forward a message to another server.

Yuri · Post by **Yuri** » Thu Oct 17, 2024 4:16 am

Response to the character encoding is here. Think we've hijacked this poor topic enough as it is.

barnacle · Post by **barnacle** » Thu Oct 17, 2024 4:37 am

Well, given that it's already moved from building a compiler to building a basic interpreter, I shouldn't worry too much.

But I will note in passing that the text format on my finished system will be ascii; eventually, storage will likely be FAT16/3.8 on compact flash. So _storage_ density is immaterial (the smallest CF I have is 2GB) but in-place does what it can both to decrease the overall size and simplify and speed up the interpreter by deleting any spaces between the line number and the code (and replacing them with indents on list) and by tokenising all keywords to eight-bit characters, with the high bit set.

Neil

BigEd · Post by **BigEd** » Thu Oct 17, 2024 7:48 am

Yuri did a great thing there! Even better if Garth could whisk the latter part of this topic over to Yuri's.

drogon · Post by **drogon** » Thu Oct 17, 2024 11:06 am

barnacle wrote:

Well, given that it's already moved from building a compiler to building a basic interpreter, I shouldn't worry too much.

But I will note in passing that the text format on my finished system will be ascii; eventually, storage will likely be FAT16/3.8 on compact flash. So _storage_ density is immaterial (the smallest CF I have is 2GB) but in-place does what it can both to decrease the overall size and simplify and speed up the interpreter by deleting any spaces between the line number and the code (and replacing them with indents on list) and by tokenising all keywords to eight-bit characters, with the high bit set.

Neil

Good to hear the progress & ideas. I'm keen to follow along.

My own TinyBasic is derived from those of the early era - it runs in a sort of virtual machine with the IL (Intermediate Language) and assembler support routines. It could be vastly improved - but at what cost? Right now it's completely self contained in 4KB including IO and file load/save routines....

My "big" Basic (written in C) tokenises everything. Absolutely everything. A REM statement? Yes, that's a token, as is the text following (part of the sam token) ... The // (alternative REM) is also a token, but a different one. Arithmetic signs are also tokens - even though they only require one byte in the source code, they end up as 4 bytes (a word - it's aimed at 32-bit systems) in the internal representation.

There is a compromise - probably the existing "classic" (?) MS style Basics or BBC Basic.

But I'd agree with ASCII - this is a retro project so use retro and period appropriate character set - and even today, UTF-8 is ASCII - at least for the first 128 bytes... I've used an "8-bit ASCII" in my fonts (where I've defined my own fonts) based loosely on the Acorn font which is fine but even then has minor irritations with characters like # and £ and € ...

I have an 8-bit version of my "Big" Basic on the back burner - sadly that's where it might stay, but who knows.

-Gordon

barnacle · Post by **barnacle** » Thu Oct 17, 2024 11:59 am

I am curious to see how much space it will take once I convert it to 65c02 assembly - which will be the final(ish) phase.

It's been interesting finally figuring our Dr Crenshaw's method, and it's remarkably easy to expand.

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Oct 17, 2024 1:21 pm

BigEd wrote:

Yuri did a great thing there! Even better if Garth could whisk the latter part of this topic over to Yuri's.

Yeah, we did get into the weeds a bit.

Kinda my fault, too.

We now return to our regular programming. Now what was that?

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Oct 17, 2024 1:35 pm

barnacle wrote:

I am curious to see how much space it will take once I convert it to 65c02 assembly - which will be the final(ish) phase.

Once you do that, I might get a burr under my saddle and see if I can spin a 65C816 rendition. While a C02 rendition would run on the 816 in native mode as is—assuming no Rockwell extensions and no zero-page hinkiness in the code, some of the 816-specific features might help to compact code and improve execution speed. I’ve been sorta looking for a language implementation to play around with on POC V1.3, and this might be it.

Quote:

It's been interesting finally figuring our Dr Crenshaw's method, and it's remarkably easy to expand.

For some reason, possibly related to my advancing age, slowly-dying synapses and long-term aversion to high-level languages running on 6502 hardware, I have found Dr. Crenshaw’s expositions somewhat slow-going. I’m quite fuzzy on how nested parentheses get evaluated, which is kind of key to a lot of language implementations...except, maybe, Forth.

Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler