Character Encodings (WAS: Crenshaw - Let's Build a Compiler)

Yuri · Post by **Yuri** » Thu Oct 17, 2024 4:15 am

A continuation from: Crenshaw - Let's Build a Compiler

BigDumbDinosaur wrote:

Yuri wrote:

To be honest, I don't think I ever generally worried too much about a few spaces vs. a tab even when I was working on my 386 and storing stuff on floppy. Considering that files all are going to take a multiple of 512byte blocks I don't think i ever noticed the difference in disk usage.

I started working with computers before microprocessors even existed.

The first system I professionally programmed had 8K of RAM...that’s 8 kiloBITs, not bytes.

You can be sure we were “encouraged” to economize so code and data would fit into available core.

Fair enough, the first computer I worked with was a Mac 128K, so I had considerably more memory than that starting out. *shrug*

Hardly terrabytes and gigabytes of space though.

Quote:

White space that consists of blanks (ASCII 32) is rendered as a single space. An extended blank (ASCII 160) is treated by most browsers as a distinct character—a string of them will be rendered as an equivalent number of blanks like this.

Character 160 is a nonstandard 8-bit ASCII character, and is "non-breaking space". Under UTF-8 that character would get translated into two bytes, 0xC2 0xA0. UTF-16 encodes it as 0x00A0 (160).

The HTML spec accounts for non-breaking spaces though. Most lay people wouldn't know how to type a literal character 160 on their keyboard, hence the use of and other HTML entities.

Quote:

Browser behavior when encountering a tab character varies. Some browsers seem to render a tab as a series of blanks. I’m not up on the latest HTML standard, so I have no clue what a browser is supposed to do when it encounters a tab or other control character in HTML.

I think it is supposed to ignore it and pass it on to whatever handles actually rendering text to deal with. (Basically leave it up to the character encoding on the OS to figure out what to do with it)

Quote:

Who’s talking about a spreadsheet? I’m referring to computing in general. BTW, I’ve never found a good use for a spreadsheet.

I've used them for all sorts of things. Millage may vary of course depending on what you're used to.

In any event, I was pointing out that the practice of using a TAB (or other such single character) as a data delimiter is not dead and very much still alive and in use.

Quote:

I can tell you haven’t worked enough with real computers!

Just kidding!

IDK, think I'm going on 40 years now; parents couldn't pry me off the computer when I was a kid, as much as they sometimes tried.

Quote:

One of the most useful of the 31 control codes is <ESC>. If you look very carefully at how devices such as “dumb” terminals (which, starting with the WYSE 60, became quite smart) work, or how a modern page printer understands what the computer is telling it to do, you will see how it is possible to extend those 31 control codes almost to infinity by beginning your control sequence with <ESC>.

Not sure I see much difference really between <ESC>(insert sequence of chars to interpret); and <(insert sequence of chars to interpret)>. In the end they are just characters and when it comes down to it, what the computer sees are just numbers; it's the meaning that is assigned to them that makes it significant. The only real significant difference in my mind is that almost all software will try and intercept the press of the escape key and do something special with it. (It is a control character/key after all.)

Quote:

I, as a user, am interested in my documents being properly formatted, but I don’t particularly care if the actual mumbo-jumbo that, say, selects italic Helvetica as the current font is human-readable—I'm not going to do a hex dump of the file. All I care about is what ends up on the printed page.

But if you were making a document to be rendered on multiple different computers without the need to write a specialized editor application, you probably would be. (Hence HTML for example)

Quote:

Yes, but how do you explain the near-universality of Hewlett-Packard’s printer control language (PCL), which is not a plain-text format? PCL commands always start with either <ESC> or certain other ASCII control codes, e.g., <FF> (ASCII 12) to dump the image buffer to the page and then eject it. A typical PCL command might be <ESC>&l0O, which selects portrait orientation. Don’t you think if H-P had thought human readability was important they would have instead implemented <orientation portrait> or similar, instead of some ESCape mumbo-jumbo? H-P did what they did because they were mostly concerned about throughput...the less overhead passed in the data stream, the better the throughput.

I'm not overly familiar with PCL or how it came to be, that being said, as far as I know the intention was to allow a driver to communicate with a printer.

Wikipedia states it "became a de facto industry standard," suggesting to me that it wasn't really HP's intention to make any kind of standard at all. And I'd guess that what happened is others reverse engineered it to make a "compatible" product.

(Pure speculation on my part though)

Form feed got a lot of meaning from the teletype machines much like CR and LF did. I don't doubt some of the other characters had other specialized purposes back in the day, but heck if I know what "Device Control One" (17), "Device Control Two" (18), etc are supposed to mean/do.

That being said, there are formats that do have both text and binary versions. PDF and FBX for example come to mind.

Plain text formats that also comes to mind, other than HTML, would be TeX, CSV, innumerable Un*x config files (e.g. /etc/fstab), and a goodly number of various internet protocols. (SMTP, HTTP, IRC, to name a few)

Quote:

Clearly, the ability of disparate systems to parse a data stream couldn’t been much of a problem with the widely-used ANSI/ECMA control sequences used with many displays, including the Linux console. A typical ANSI/ECMA sequence starts with that ubiquitous <ESC> and finishes with mumbo-jumbo that even someone like me who has been working with it for some forty years still can’t decipher on sight.

The one I often shake my fist at because there isn't a clean way to determine when a sequence starts and ends if you don't know what those random letters are? The one that as I recall, started out as simply enough and got bastardized over the decades to become the monster it is now?

<ESC>[(ignore until); only works part of the time. Thus leading Linux and other Un*x like variants to use the (in)famous termcap file. Which is just a massive database, that, as I vaguely recall someone else here putting it, "must be maintained by someone who is mentally unstable." (not a direct quote)

Heck, even just trying to support the VT-100 codes for my own 65xxx software is a bit of a nightmare because different terminals all want to send the codes in their own "special" way..... >_>

"You said you wanted us to send VT-100 codes, but we decided that for these 4 keys, we'd use the XTerm escapes instead....." <screams internally>

Don't get me started with what my poor friend has to deal with, when working on their BBS software..... (I never hear the end of it when a bug comes in about another BBS or terminal that doesn't work because it follow the "spec" *eyeroll*)

If I'm not mistaken, isn't the termcap file itself plain text? (Or starts out as such and then gets compiled or something like that?)

Quote:

ASCII is the character-encoding standard in the computing world and has been so since the 1960s, IBM’s EBCDIC notwithstanding. GSM is primarily a telecommunications thing—it’s not something that an E- mail server would use to forward a message to another server.

Yet, perhaps somewhat ironically, that is almost exactly what the software I work on in my day job does.

jgharston · Post by **jgharston** » Wed Oct 23, 2024 10:55 pm

Yuri wrote:

The one I often shake my fist at because there isn't a clean way to determine when a sequence starts and ends if you don't know what those random letters are? The one that as I recall, started out as simply enough and got bastardized over the decades to become the monster it is now?

<ESC>[(ignore until); only works part of the time. Thus leading Linux and other Un*x like variants to use the (in)famous termcap file. Which is just a massive database, that, as I vaguely recall someone else here putting it, "must be maintained by someone who is mentally unstable." (not a direct quote)

ANSI <esc> sequences are defined as being <esc> ([) (any characters < &40) (terminating character >&3F). So a sequence always ends with @ABCDE....z{|}~ so there /is/ a clean way to scan past a sequence, you just keep stepping past until you get to a %01xxxxxx byte.

Quote:

Heck, even just trying to support the VT-100 codes for my own 65xxx software is a bit of a nightmare because different terminals all want to send the codes in their own "special" way..... >_>
"You said you wanted us to send VT-100 codes, but we decided that for these 4 keys, we'd use the XTerm escapes instead....." <screams internally>

Gawd, while *outgoing* ANSI/esc sequences are fairly well documented, it's taken me years of research and testing to document *incoming* sequences: link. Again, they are defined to be <esc> ([) (any <&40) (terminating >&3F) but finding *what* sequences are emitted by *what* keypresses is like fighting through a brick wall. Almost all infomation says "write this code to see what your keyboard gives". I DON'T CARE WHAT *MY* KEYBOARD GIVES, I *KNOW* what *MY* keyboard gives, I need to know what *OTHER* *PEOPLE'S* keyboards give.

Or, a related rant: I'm *WRITING* a keyboard, so what should *I* be emitting?

Yuri · Post by **Yuri** » Thu Oct 24, 2024 3:25 am

jgharston wrote:

ANSI <esc> sequences are defined as being <esc> ([) (any characters < &40) (terminating character >&3F). So a sequence always ends with @ABCDE....z{|}~ so there /is/ a clean way to scan past a sequence, you just keep stepping past until you get to a %01xxxxxx byte.

Unless you consider these:
* Save cursor position: <esc>7
* Restore cursor position: <esc>8
* Set video mode <esc>[=<num>

But wait there's more! Let's look at all the codes that XTerm defines....
* <esc>#3
* <esc>#4
* <esc>#5
* <esc>#6
* <esc>#8
* <esc>%@
....

I hate 'em! >_<

Yuri · Post by **Yuri** » Thu Oct 24, 2024 7:02 am

For anyone who is interested, a friend of mine (same one that likes to work on BBS software) sent this to me:
https://vt100.net/emu/dec_ansi_parser

It has a state machine flow of how the VT 500 hardware terminals do their escape parsing.

BigEd · Post by **BigEd** » Thu Oct 24, 2024 9:01 am

That's nice! There's a good site here: https://vt100.net/dec/vt100/rom/bugs

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Oct 24, 2024 3:54 pm

The VT/ANSI/ECMA48 mess is just that: a big, friggin’, designed-by-committee, implemented-by-a-drunken-monkey mess. When WYSE developed their terminals, starting with the WY50 (and most famously, the WY60), they studiously avoided all that CSI... mumbo-jumbo and adopted a pseudo-binary control scheme that is far easier to implement and much easier to work with on the server side. It’s a shame the industry stubbornly stuck with the VT/ANSI/ECMA48 mess, because there are much simpler ways of accomplishing the same thing.

jgharston · Post by **jgharston** » Fri Oct 25, 2024 6:58 pm

Yuri wrote:

jgharston wrote:

ANSI <esc> sequences are defined as being <esc> ([) (any characters < &40) (terminating character >&3F). So a sequence always ends with @ABCDE....z{|}~ so there /is/ a clean way to scan past a sequence, you just keep stepping past until you get to a %01xxxxxx byte.

Unless you consider these:
* Save cursor position: <esc>7
* Restore cursor position: <esc>8
* Set video mode <esc>[=<num>

But wait there's more! Let's look at all the codes that XTerm defines....
* <esc>#3
* <esc>#4
* <esc>#5
* <esc>#6
* <esc>#8
* <esc>%@
....

I hate 'em! >_<

Ahbut, they're not ANSI.

But yes, horrible mess.

Yuri · Post by **Yuri** » Fri Oct 25, 2024 7:31 pm

jgharston wrote:

Ahbut, they're not ANSI.

But yes, horrible mess.

It's enough to make a person want to develop their own standard!

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Oct 26, 2024 4:13 am

Yuri wrote:

jgharston wrote:

Ahbut, they're not ANSI.

But yes, horrible mess.

It's enough to make a person want to develop their own standard!

That’s cartoon reminds me of the classic song “99 Bottles of Beer On the Wall,” but in reverse.

Character Encodings (WAS: Crenshaw - Let's Build a Compiler)

Character Encodings (WAS: Crenshaw - Let's Build a Compiler)

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi

Re: Character Encodings (WAS: Crenshaw - Let's Build a Compi