Well, you picked a good time to post this, since I've just been dealing with some moderately deep font vs. character set vs. character encoding issues for a Microsoft BASIC detokenizer. (
Bastok, currently supporting MSX BASIC Japanese charset, in case you're interested.)
As an initial note: the whole encoding/charset/font distinction that I'm describing here may seem overly complex, but I've found that without this framework it's difficult to describe or even understand many problems related to the representation of characters.
Looking at your code sketches:
Code:
copy_chr_bitmap (tuple of characters), $target_addr, [optional parameters]
copy_chr_bitmap (A, Z), $c3ff, ('eor #$ff')
what you seem to want on the left are
code points (or perhaps
encoded code points), but what you're operating on are
fonts. The same issue comes up in your textual comments:
Quote:
But, it's not always clear which bitmap maps to what character given the way PETSCII screen codes work.
PETSCII and screen codes are two separate things, by the way. In the terminology I explain below PETSCII is two coded character sets and two encodings (probably; CBM doesn't actually clearly define PETSCII) and screen codes are font glyph codes. There are mappings between these, but they are not even the same kind of thing at the conceptual level.
────────────────────────────────────────
Let's work through the pieces we have here. It's important to remember that this is all conceptual work on our part: computers just shovel numbers around and any meaning assigned to them is both context-dependent and outside the computer. When we say that the number $50 or $D0 or $10 means "capital letter P" (and all three of those do in various situations on the C64), that's entirely in our heads; the computer doesn't know or care about any of that.
The terms I use here are similar to those in the
Unicode Glossary but with a few changes for clarity, especially with coded character sets (UC: "encoded character sets") vs. encodings (UC: "encoding forms"). Where I've tweaked the terminology, I may give the Unicode terminology as "(UC:
term)" after my term, except where I feel it would create more confusion.
The core of the abstraction we generally use for dealing with text is the
character set, which is simply a finite set of
characters (UC: "abstract characters") that have certain semantic meanings, such as "capital letter P," "lower-case letter i," "digit 7," "lower-case greek letter π," "line drawing vertical bar," etc. Very closely related to this is the
coded character set where we assign an integer called a
code point to each of these characters, allowing us to reference them more easily. These integers are "pure" mathematical integers (0, 1, 7, 12345, etc.); they are not "8-bit" or "16-bit" integers or anything like that.
When transmitting a character or chraracters in a particular character set between two machines, programs, functions in code, or whatever, we use an
encoding (UC: "character encoding scheme") to give us a binary value that can be read and, via some fixed set of rules, give us a sequence of code points for characters in the coded character set we're using. This can be extremely simple, or moderately complex. For example:
- Encode the ASCII coded character set (codes points 0 through 127) as a sequence of seven-bit units sent down a serial line where each unit is interpreted as a binary unsigned integer corresponding exactly to the ASCII character code.
- Encode the ASCII coded character set as a sequence of eight-bit bytes in memory where each value $00-$7F corresponds exactly to the ASCII character code, and where each value $80-$FF is an invalid encoding sequence that produces an error.
- Encode the ASCII coded character set as a sequence of eight-bit bytes in memory where each value $80-$FF has $80 subtracted from it to produce the the ASCII character code, and where each value $00-$7F is an invalid encoding sequence that produces an error. (This is used internally by the Apple II.)
- Encode the Unicode coded character set (codes points 0 through 1114111) as UTF-8. Each byte $00-$7F corresponds to the Unicode code point in that range. Code points 128 and above are encoded with multibyte sequences that are two or more bytes with the high bit set, as shown in this table.
Now you can see how, for simpler encodings, confusion can arise between the encoded code points and the code points themselves. After all, for typical uses of ASCII, the bit sequence 1010000 and the code point $50 (hex) or 80 (decimal) are usually thought of as the same thing. But it's important to keep these conceptually separate because otherwise confusion can arise about what coded character set you're actually using, as we'll see below.
The last piece of the abstraction is
fonts, which contain
glyphs each identified by a
glyph code. The glyph code doesn't necessarily bear any connection to a code point. For example, ASCII code point $50 ("upper case P") would be rendered using glyph code $10 in the CBM "graphics" font but glyph code $50 in the CBM "text" font. Glyph codes are also known as "screen codes" in CBM-speak, and since we're talking about CBM stuff here, I'll use that term from here on.
Note that the standard CBM fonts actually include only 128 characters, but
two versions of every character: in the graphics font screen code $10 gives a glyph for "upper case P," and screen code $90 gives another glyph for "upper case P" in which the colours are inverted.
The
Aivosto PETSCII reference you provided shows screen codes and their glyphs for both standard fonts on page 14. You may also find the page
Commodore 64 screen codes useful; I think that that one makes it easier to see the differences and commonalities between the two fonts.
────────────────────────────────────────
Now that we have all this out of the way, the next step is to think about what exactly you're trying to use on the left-hand side of your copy_chr_bitmap routine.
The easiest thing to do would be to use screen codes. So for example, if a programmer wants to copy the glyphs for characters 'A' through 'Z', he would pass in screen codes $01 through $1A if using the graphics font and $41 through $5A if using the text font. (Or $81-$9A / $B1-$CA for the inverted renderings of characters 'A'-'Z'.)
But it's looking as if you want the programmer to be able to use encoded characters. As soon as you go there, you have the issue of how you determine which character set is in use, how that character set is encoded, and how the code points in the character set map to glyphs in the font.
The encoding is probably best kept (at least initially) to bytes valued 0-255 that map directly to code points in a character set. That leaves you with the problem of chosing the character set and figuring out how that maps to the particular font that's being used.
If you don't mind restricting yourself to ASCII, which has only 95 printable characters, then programmers could pass in codes $20-$7E, you would have to figure out how these map to the particular character set in question (perhaps the programmer also chooses a mapping and passes that in), and then the rest seems easy.
You could use just one fixed PETSCII character set in the same way, but if you want to allow the use of
both PETSCII character sets you now need a way for the programmer to specify which of the two character sets he's using. If you say that for each character set you can use only one font, your problem there is then more or less solved, but if the programmer should be able to pull characters from arbitrary fonts, the different mappings for each character set to the different fonts needs to be dealt with. For example ("CP" below means "code point"):
Code:
charset char CP font screen code
---------------------------------------
ASCII P $50 graphics $10
ASCII P $50 text $50
PETSCII graphics P $50 graphics $10
PETSCII graphics P $50 text $50
PETSCII text P $70 graphics $10
PETSCII text P $70 text $50
Another option, if you want to be able to deal with arbitrary characters, would be to develop your own character set, coding all of the 163 characters in the union of both fonts and perhaps some control codes as well, all of which you should be able easily to fit in 256 code points. You might use the same code points as ASCII in the lower half, put CBM-specific control codes in $80-$9F, and use $A0-$FF for the 71 the graphics characters.
Perhaps you can think through what you're trying to achieve, try fit it into the model above, and then come back and explain it in terms of that model. From there we can work forward and try to figure out the API for doing that. (Or more likely, work out together what you're trying to acheive and whatever compromises might be made to simplify that while still getting done most of what you want done.)