Historical question about ASCII

Dan Moos · Post by **Dan Moos** » Tue May 02, 2017 2:42 am

I've always wondered this.

Is there a reason that the ASCII codes 0-9 do not map to the same characters? As in ASCII 0 is "0", 1 is "1", ect...

Its just an annoyance to deal with when converting strings to numbers today, but on the computers of yesterday, wouldn't the extra step of subtracting "0" from the characters to make the numbers match have been a bigger deal computation-wise?

The only problem I see is that having "0" be 0 is that you couldn't have a NULL character. Is that reason enough? As I write this, that small thing does seem like a good reason I guess.

Anyone know how it came to be as it is?

GARTHWILSON · Post by **GARTHWILSON** » Tue May 02, 2017 4:53 am

I sure wish the A came right after the 9, so you wouldn't have to test in the conversion of hex numbers, just subtract $30.

Arlet · Post by **Arlet** » Tue May 02, 2017 5:30 am

GARTHWILSON wrote:

I sure wish the A came right after the 9, so you wouldn't have to test in the conversion of hex numbers, just subtract $30.

That wouldn't help me, because I generally prefer lower case

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue May 02, 2017 5:30 am

Dan Moos wrote:

I've always wondered this.

Is there a reason that the ASCII codes 0-9 do not map to the same characters? As in ASCII 0 is "0", 1 is "1", ect...

Its just an annoyance to deal with when converting strings to numbers today, but on the computers of yesterday, wouldn't the extra step of subtracting "0" from the characters to make the numbers match have been a bigger deal computation-wise?

The only problem I see is that having "0" be 0 is that you couldn't have a NULL character. Is that reason enough? As I write this, that small thing does seem like a good reason I guess.

Anyone know how it came to be as it is?

It's a long story and I suggest you do some searching, starting with Baudot code, which is the distant ancestor of ASCII that was intended for use with nineteenth century telegraph systems. Also, read about teleprinters, which were an essential tool of news services such as API and Reuters for many years.

Despite the seeming incongruities of ASCII with character sets, there is a method to the madness. Only those of us who have been around long enough to have worked with Tele-Type machines and Friden Flexowriters consider ASCII to be 100 percent logical.

Arlet · Post by **Arlet** » Tue May 02, 2017 5:32 am

A bit of explanation:
https://en.wikipedia.org/wiki/ASCII#History

Rob Finch · Post by **Rob Finch** » Tue May 02, 2017 10:55 am

I was wondering at one point why the newer standard Unicode doesn't support things that appear in a keyboard stream like cursor controls. I got talking to a Unicode expert about it one day and they seemed to have a reasonable explanation. So I made up my own set of virtual keycodes for one project.
http://www.finitron.ca/Documents/virtual_keycodes.html
I'd like to know what is the standard for virtual keycodes ?
ASCII is an older code which is great when characters fit into six or eight bits. But for any apps that need to be internationalized a wide code like Unicode is required.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue May 02, 2017 9:22 pm

Rob Finch wrote:

ASCII is an older code which is great when characters fit into six or eight bits. But for any apps that need to be internationalized a wide code like Unicode is required.

Actually, there is only one form of ASCII, called "US-ASCII," and that is seven bits to the datum. ASCII does not define the meanings of data that are in the range $80 to $FF, inclusive. ASCII was strictly a product of the forerunner of the American National Standards Institute, hence the US-ASCII moniker. INCITS uses that reference to avoid confusion with informal extensions to the ASCII set, as well as ASCII-like enhancements, such as Unicode.

Unicode produces a bulkier data stream than ASCII and in situations in which the alphanumeric set plus punctuation and control codes is all that is needed, ASCII will be substantially more efficient and economical of bandwidth. For example, transmitting binary data in Intel or Motorola hex form can be done solely with seven bit ASCII, using only numerals, uppercase letters A-F and a few control codes (typically <CR>, <LF> and <EOT>). Western languages that use only the Latin alphabet are transmittable in ASCII and even when lacking some diacritical marks, are usually intelligible to native speakers. Unicode was primarily developed to handle Latin characters with diacritical marks, such as ü, å, etc., localized characters, such as Æ, as well as the characters found in non-Latin alphabets, e.g., Cyrillic.

Incidentally, one of the reasons the control codes in the ASCII set are $00-$1F is the mechanism in Tele-Types was arranged to recognize the low bit patterns as control functions, not printing characters. As ASCII evolved, this characteristic was accommodated so Tele-Types could be used as computer I/O devices. The spread between "0-9", "A-Z" and "a-z" exists because two bits decide if the character set will be numerals, uppercase letters or lowercase letters. It all makes sense in the contexts of doing case conversion or determining if a user typed a numeral or a letter of either case.

BigEd · Post by **BigEd** » Tue May 02, 2017 9:38 pm

Unicode, when encoded in UTF-8, and carrying ASCII, is no more bulky - unless you regard 8 bits as more than 7 bits, which you might!

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue May 02, 2017 9:56 pm

BigEd wrote:

Unicode, when encoded in UTF-8, and carrying ASCII, is no more bulky...

Correct on UTF-8, which parallels ASCII, but supports the $80-$FF range. I was referring to UTF-16.

Quote:

- unless you regard 8 bits as more than 7 bits, which you might!

Funny you mention that. We think in terms of eight bits to the byte, yet in serial communications, seven bits continue to be used in setups that use only ASCII, hand-held serial bar code scanners being one such case. As another example, I have here in my office a Welch-Allen ST6980 magnetic strip reader (MSR, aka credit card reader) that is interfaced to a TIA-232 port on my Linux software development machine. The reader uses seven bit data format, which means, in theory, the UART has less work to perform to serialize and deserialize a datum being exchanged with the MSR.

However, I suspect that any performance gain in that regard will be vanishingly small.

BigEd · Post by **BigEd** » Tue May 02, 2017 10:01 pm

(Umm, UTF-8 expresses any Unicode character - as can UTF-16, but UTF-8 is more compact.)

commodorejohn · Post by **commodorejohn** » Tue May 02, 2017 10:37 pm

BigDumbDinosaur wrote:

However, I suspect that any performance gain in that regard will be vanishingly small.

Well, at the same baud rate, assuming a fairly standard-ish packet with one stop bit and one start bit per character, it's an 11% increase in throughput.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 03, 2017 12:26 am

BigEd wrote:

(Umm, UTF-8 expresses any Unicode character - as can UTF-16, but UTF-8 is more compact.)

I don't believe UTF-8 supports many non-Latin character sets, such as traditional Chinese. In fact, I seem to recall that pairs of UTF-16 words may be used in such cases, resulting in 32 bits being passed per character.

commodorejohn wrote:

BigDumbDinosaur wrote:

However, I suspect that any performance gain in that regard will be vanishingly small.

Well, at the same baud rate, assuming a fairly standard-ish packet with one stop bit and one start bit per character, it's an 11% increase in throughput.

You are confusing the data rate in bits per second with baud rate—the two are not directly related. Baud refers to the symbol rate on the medium, not the data transmission rate. In the case of telephone modems, baud rate is often a fraction of the bit rate due to the encoding scheme being used. For instance, a typical analog telephone link that spans more than one central office cannot support frequencies much above three kilohertz. If the baud rate of a pair of modems using that link was the same as the practical bit rate you'd have a 3 Kbps link, neglecting the effects of errors. The 56K rate achieved with the V.90 standard was the result of using an advanced encoding scheme that allow many bits to be exchanged within a single symbol, using 8000 baud coming to the subscriber and 3429 going from the subscriber.

On a hardwired TIA-232 link, baud rate and bit rate are the same. If a pair of short-haul modems is introduced, as was a once-common arrangement in large factories and adjacent buildings sharing a common host, the baud rate between the modems will usually be lower than the bit rate on the TIA-232 connections to the modems, since the bandwidth limits of analog telephone lines still apply.

Incidentally, a format of seven data bits and two stops bits is possible with most serial devices, as is seven data bits, one stop bit and parity, the latter which is often used with MSRs and bar code scanners.

BigEd · Post by **BigEd** » Wed May 03, 2017 6:51 am

BigDumbDinosaur wrote:

BigEd wrote:

(Umm, UTF-8 expresses any Unicode character - as can UTF-16, but UTF-8 is more compact.)

I don't believe UTF-8 supports many non-Latin character sets, such as traditional Chinese. In fact, I seem to recall that pairs of UTF-16 words may be used in such cases, resulting in 32 bits being passed per character.

UTF-8 is really rather clever and interesting - I think perhaps you don't know what it is. Well worth looking into!

rwiker · Post by **rwiker** » Wed May 03, 2017 7:10 am

BigDumbDinosaur wrote:

BigEd wrote:

(Umm, UTF-8 expresses any Unicode character - as can UTF-16, but UTF-8 is more compact.)

I don't believe UTF-8 supports many non-Latin character sets, such as traditional Chinese. In fact, I seem to recall that pairs of UTF-16 words may be used in such cases, resulting in 32 bits being passed per character
.

This is incorrect. UTF-8 and UTF-16 are different encodings of the same character set, Unicode. UTF-8 encodings can be up to 6 bytes long (I think - it's been a while since I looked at the details), but the US-ASCII subset of Unicode needs only one byte for each character.

Bregalad · Post by **Bregalad** » Wed May 03, 2017 9:48 am

Quote:

I don't believe UTF-8 supports many non-Latin character sets, such as traditional Chinese. In fact, I seem to recall that pairs of UTF-16 words may be used in such cases, resulting in 32 bits being passed per character.

UTF-8 fully supports traditional Chinese.

Quote:

Unicode, when encoded in UTF-8, and carrying ASCII, is no more bulky - unless you regard 8 bits as more than 7 bits, which you might!

UTF-7 comes to the rescue, then.

Historical question about ASCII

Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII

Re: Historical question about ASCII