6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 3:21 am

All times are UTC




Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Sun Aug 30, 2020 5:27 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
I have a segment of code that copies CHARGEN glyphs from ROM into RAM. This initialization code takes a couple of different forms. I'm trying to create an abstract macro to handle all of the cases.

Here is an example segment, one of several similar cases:

Code:
          ldx #$d0         ; +208 / 8 = 26, because we're copying bitmaps [A-Z].
B8038     lda $d007,x      ; Copy [A-Z] to RAM.
          not              ; alias for `eor #$ff .A` for readability, reverse video effect
          sta $c3ff,x      ; Bitmap storage area.
          dex              ; .X--
          bne B8038        ; BNE because we're copying > 127, else BPL.


Copying a small number of characters such as [0-9] loops downward until a negative number is reached. I'm not sure why the original coder didn't use BNE and a countdown. Someone correct me if I'm wrong but that would have resulted in identical branch instruction logic, right?

Anyway, what I'm trying to build is a very abstract macro, mainly for readability, that looks something like this:

Code:
copy_chr_bitmap (tuple of characters), $target_addr, [optional parameters]

So, to in keeping with the previous example, the macro would look like this:

Code:
copy_chr_bitmap (A, Z), $c3ff, ('eor #$ff')


I can work that out, but I'm confused on one point. It seems that some characters repeat in different segments of the CHARGEN ROM and I'm really tying to make something that will be very flexible here. I've stepped through most of the GARGEN ROM with the VICE memory monitor and the `mc` command. But, it's not always clear which bitmap maps to what character given the way PETSCII screen codes work. I can always add some modifiers to my macro lookups like a '^' prefix to indicate the uppercase character set or `[X]` to indicate reverse video, etc. It could be extended with a bit more logic to handle a tuple of specific glyphs and not just a range by checking the number of elements in the tuple, etc.

Can someone point me in the right direction? I checked Aivosto Oy's excellent PETSCII chart (PDF) but I don't see a segment that maps CHARGEN to glyphs, unless I missed something.

I realize this is kind of range function meets regex-like functionality, which may seem oddly high level for an ASM macro. I still think I'll learn a lot about 64tass macro engine in the process.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 31, 2020 4:15 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 730
Location: Tokyo, Japan
Well, you picked a good time to post this, since I've just been dealing with some moderately deep font vs. character set vs. character encoding issues for a Microsoft BASIC detokenizer. (Bastok, currently supporting MSX BASIC Japanese charset, in case you're interested.)

As an initial note: the whole encoding/charset/font distinction that I'm describing here may seem overly complex, but I've found that without this framework it's difficult to describe or even understand many problems related to the representation of characters.

Looking at your code sketches:
Code:
copy_chr_bitmap (tuple of characters), $target_addr, [optional parameters]
copy_chr_bitmap (A, Z), $c3ff, ('eor #$ff')

what you seem to want on the left are code points (or perhaps encoded code points), but what you're operating on are fonts. The same issue comes up in your textual comments:

Quote:
But, it's not always clear which bitmap maps to what character given the way PETSCII screen codes work.

PETSCII and screen codes are two separate things, by the way. In the terminology I explain below PETSCII is two coded character sets and two encodings (probably; CBM doesn't actually clearly define PETSCII) and screen codes are font glyph codes. There are mappings between these, but they are not even the same kind of thing at the conceptual level.

────────────────────────────────────────

Let's work through the pieces we have here. It's important to remember that this is all conceptual work on our part: computers just shovel numbers around and any meaning assigned to them is both context-dependent and outside the computer. When we say that the number $50 or $D0 or $10 means "capital letter P" (and all three of those do in various situations on the C64), that's entirely in our heads; the computer doesn't know or care about any of that.

The terms I use here are similar to those in the Unicode Glossary but with a few changes for clarity, especially with coded character sets (UC: "encoded character sets") vs. encodings (UC: "encoding forms"). Where I've tweaked the terminology, I may give the Unicode terminology as "(UC: term)" after my term, except where I feel it would create more confusion.

The core of the abstraction we generally use for dealing with text is the character set, which is simply a finite set of characters (UC: "abstract characters") that have certain semantic meanings, such as "capital letter P," "lower-case letter i," "digit 7," "lower-case greek letter π," "line drawing vertical bar," etc. Very closely related to this is the coded character set where we assign an integer called a code point to each of these characters, allowing us to reference them more easily. These integers are "pure" mathematical integers (0, 1, 7, 12345, etc.); they are not "8-bit" or "16-bit" integers or anything like that.

When transmitting a character or chraracters in a particular character set between two machines, programs, functions in code, or whatever, we use an encoding (UC: "character encoding scheme") to give us a binary value that can be read and, via some fixed set of rules, give us a sequence of code points for characters in the coded character set we're using. This can be extremely simple, or moderately complex. For example:

  • Encode the ASCII coded character set (codes points 0 through 127) as a sequence of seven-bit units sent down a serial line where each unit is interpreted as a binary unsigned integer corresponding exactly to the ASCII character code.
  • Encode the ASCII coded character set as a sequence of eight-bit bytes in memory where each value $00-$7F corresponds exactly to the ASCII character code, and where each value $80-$FF is an invalid encoding sequence that produces an error.
  • Encode the ASCII coded character set as a sequence of eight-bit bytes in memory where each value $80-$FF has $80 subtracted from it to produce the the ASCII character code, and where each value $00-$7F is an invalid encoding sequence that produces an error. (This is used internally by the Apple II.)
  • Encode the Unicode coded character set (codes points 0 through 1114111) as UTF-8. Each byte $00-$7F corresponds to the Unicode code point in that range. Code points 128 and above are encoded with multibyte sequences that are two or more bytes with the high bit set, as shown in this table.

Now you can see how, for simpler encodings, confusion can arise between the encoded code points and the code points themselves. After all, for typical uses of ASCII, the bit sequence 1010000 and the code point $50 (hex) or 80 (decimal) are usually thought of as the same thing. But it's important to keep these conceptually separate because otherwise confusion can arise about what coded character set you're actually using, as we'll see below.

The last piece of the abstraction is fonts, which contain glyphs each identified by a glyph code. The glyph code doesn't necessarily bear any connection to a code point. For example, ASCII code point $50 ("upper case P") would be rendered using glyph code $10 in the CBM "graphics" font but glyph code $50 in the CBM "text" font. Glyph codes are also known as "screen codes" in CBM-speak, and since we're talking about CBM stuff here, I'll use that term from here on.

Note that the standard CBM fonts actually include only 128 characters, but two versions of every character: in the graphics font screen code $10 gives a glyph for "upper case P," and screen code $90 gives another glyph for "upper case P" in which the colours are inverted.

The Aivosto PETSCII reference you provided shows screen codes and their glyphs for both standard fonts on page 14. You may also find the page Commodore 64 screen codes useful; I think that that one makes it easier to see the differences and commonalities between the two fonts.

────────────────────────────────────────

Now that we have all this out of the way, the next step is to think about what exactly you're trying to use on the left-hand side of your copy_chr_bitmap routine.

The easiest thing to do would be to use screen codes. So for example, if a programmer wants to copy the glyphs for characters 'A' through 'Z', he would pass in screen codes $01 through $1A if using the graphics font and $41 through $5A if using the text font. (Or $81-$9A / $B1-$CA for the inverted renderings of characters 'A'-'Z'.)

But it's looking as if you want the programmer to be able to use encoded characters. As soon as you go there, you have the issue of how you determine which character set is in use, how that character set is encoded, and how the code points in the character set map to glyphs in the font.

The encoding is probably best kept (at least initially) to bytes valued 0-255 that map directly to code points in a character set. That leaves you with the problem of chosing the character set and figuring out how that maps to the particular font that's being used.

If you don't mind restricting yourself to ASCII, which has only 95 printable characters, then programmers could pass in codes $20-$7E, you would have to figure out how these map to the particular character set in question (perhaps the programmer also chooses a mapping and passes that in), and then the rest seems easy.

You could use just one fixed PETSCII character set in the same way, but if you want to allow the use of both PETSCII character sets you now need a way for the programmer to specify which of the two character sets he's using. If you say that for each character set you can use only one font, your problem there is then more or less solved, but if the programmer should be able to pull characters from arbitrary fonts, the different mappings for each character set to the different fonts needs to be dealt with. For example ("CP" below means "code point"):

Code:
            charset  char CP   font       screen code
            ---------------------------------------
               ASCII  P  $50   graphics     $10
               ASCII  P  $50   text         $50
    PETSCII graphics  P  $50   graphics     $10
    PETSCII graphics  P  $50   text         $50
        PETSCII text  P  $70   graphics     $10
        PETSCII text  P  $70   text         $50


Another option, if you want to be able to deal with arbitrary characters, would be to develop your own character set, coding all of the 163 characters in the union of both fonts and perhaps some control codes as well, all of which you should be able easily to fit in 256 code points. You might use the same code points as ASCII in the lower half, put CBM-specific control codes in $80-$9F, and use $A0-$FF for the 71 the graphics characters.

Perhaps you can think through what you're trying to achieve, try fit it into the model above, and then come back and explain it in terms of that model. From there we can work forward and try to figure out the API for doing that. (Or more likely, work out together what you're trying to acheive and whatever compromises might be made to simplify that while still getting done most of what you want done.)

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 31, 2020 6:53 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
Here's a ranged table for converting petscii to screen codes: https://sta.c64.org/cbm64pettoscr.html Of course, this presumes that your macro can read the petscii code of the source code character passed into it. I'm not familiar with 64tass, but if that's a cross-assembler, then you have to figure out what it's doing with ascii/unicode source code text vs petscii as well.

Instead of mussing with the eor #$ff, you should just implement reading from the reversed chars at the top half of the font. That just should add 1024 to the base pointer, which could be simpler than injecting code like that. Of course, selecting uppercase vs lowercase fonts is also just a 2048 offset, but means interpreting the input character differently as well. It might warrant a separate macro instead of a flag.

But I think an elephant in the room is that if you want punctuation or graphical glyph ranges, passing a literal character like this isn't going to work too well. Consider what would happen if you tried to reference ")". I think it'd be better to just look it up externally and pass in numbers, commenting which characters are being referenced.

Quote:
Copying a small number of characters such as [0-9] loops downward until a negative number is reached. I'm not sure why the original coder didn't use BNE and a countdown. Someone correct me if I'm wrong but that would have resulted in identical branch instruction logic, right?

While it's a small number of characters, it's not a small number of bytes (208) which is already far into signed-negative range. But the code you pasted is already dex/bne, and does comment on the "> 127" issue, so I'm not 100% sure what you mean.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 31, 2020 10:03 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
White Flame wrote:
Here's a ranged table for converting petscii to screen codes: https://sta.c64.org/cbm64pettoscr.html Of course, this presumes that your macro can read the petscii code of the source code character passed into it. I'm not familiar with 64tass, but if that's a cross-assembler, then you have to figure out what it's doing with ascii/unicode source code text vs petscii as well.

...

I think it'd be better to just look it up externally and pass in numbers, commenting which characters are being referenced.


Yeah, 64tass is a cross-assembler and it has a very good UTF-8 to PETSCII mapping. I should be able to just use that if I have to. But, you're right, It's probably better if I just use start/stop screen codes for now. I can over-engineer things later. :)

Quote:
Instead of mussing with the eor #$ff, you should just implement reading from the reversed chars at the top half of the font.


Ahh, you spotted that too. Yeah, it's not my code. It's from an early C64 game that I'm walking back to functioning and serviceable source code. On my first pass through the code I'm not going to fix any bugs or less than optimal routines. For now the hash of the assembled file must match the hash of the original.

Eventually, I'll create a fork with fixes like that and other quality of life improvements to the code.

Quote:
load81 wrote:
Copying a small number of characters such as [0-9] loops downward until a negative number is reached. I'm not sure why the original coder didn't use BNE and a countdown. Someone correct me if I'm wrong but that would have resulted in identical branch instruction logic, right?

While it's a small number of characters, it's not a small number of bytes (208) which is already far into signed-negative range. But the code you pasted is already dex/bne, and does comment on the "> 127" issue, so I'm not 100% sure what you mean.


Yeah, sorry, I'm making several passes on a rather small section of code and learning as I go. Many of the comments are "first draft" kinds of comments that are subject to change once I fully grasp what I'm looking at. This is one of those times. You're correct, it's a signed (negative) integer according to 8-bit two's complement rules. The comment was intended to reflect that other similar routines, copying fewer bytes of bitmap, branch on plus. This one routine is BNE because that's the only viable option, without breaking up the routine, given 208 bytes in the sequence and the properties of two's complement.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 22 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: