6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon May 13, 2024 12:27 am

All times are UTC




Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Mar 29, 2020 6:29 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
I've finally returned to my "learning 6502 via disassembly and reverse engineering a cartridge game" project. My goal is to learn 6502 Assembly and be able to reliably reconstruct working source code to stand in for sources that have been lost to time.

This has been my approach in a nutshell:

  1. Get a hex dump of the file that is one byte wide. I used xxd for this.
  2. Use a counter with bash and sed to pace the memory offset (file location + start of execution) in a comment off to the right.
  3. Prefix all hex with .byte instructions, so that 64tass can theoretically assemble it as-is.
  4. Create a center column of comments that is my manual disassembly.
  5. When I'm done (I'm about 1/10th of the way through) awk the file so that my disassembly and offset comments are the the only thing left.
  6. Build that.
  7. Find and fix errors until the whole thing assembles, runs, and has the same hash as the original file.
  8. Trace all of the branches and jumps and figure out what's really going on at those memory locations.
  9. Go back and make macros and functions where it makes sense to do so.

In case that's not clear, here are the first few bytes of disassembly as a visual reference. This is the first executable instruction forward, once you get past the CRT header, CHIP segments, vectors, and "CBM80" magic number, etc.:

Code:
   .byte $78      ; sei          ;$8009
   .byte $20      ; jsr $ff84    ;$800A
   .byte $84      ; op           ;$800B
   .byte $ff      ; op           ;$800C
   .byte $20      ; jsr $ff87    ;$800D
   .byte $87      ; op           ;$800E
   .byte $ff      ; op           ;$800F
   .byte $20      ; jsr $ff8a    ;$8010
   .byte $8a      ; op           ;$8011
   .byte $ff      ; op           ;$8012
   .byte $20      ; jsr $ff81    ;$8013


Getting this far required some research on my part, because it wasn't immediately obvious that the first byte after the header was anything other than an instruction. My first clue that I had done something wrong was that several instructions in things stopped making sense and I started hitting invalid opcodes.

Even though this is tedious as hell, I'm learning a lot just by doing it. While I could use the likes of radare2 to disassembly this for me, going through this by hand is teaching me a lot. In an effort to better understand what I'm seeing I'm also reading C64 Machine Language for the Absolute Beginner by Danny Davis. Some of Mansfield's stuff is next.

My questions are:

  1. In terms of reverse engineering process/workflow and bearing in mind that my end product is reconstructed source code, am I doing this right?
  2. At a certain point, I want to do this to fairly large multi-disk C64 programs. At what point (if any) is it best to partially automate disassembly?
  3. Is anybody else doing this sort of thing in the C64 space? It seems more common among the Atari 2600 crowd.
  4. Is there a good "6502 ASM style guide" of sorts anywhere? I want the end result to be readable so that even a very fresh student of ASM can begin to follow it?
  5. Is there an unofficial "standard library" of sorts for 64 projects that will work with 64tass? I know a lot of the A2600-types use a consistent set of headers for dasm.
  6. What's with the first instruction being sei? I very dimly recall reading about this as a common thing to do for cartridges, but I don't know why this is done.

Any other constructive or useful feedback is welcome. I've just always wanted to know how this worked at the lowest possible level. I'm sure that this will make me a better programmer even outside of an assembly language context.


Last edited by load81 on Tue Mar 31, 2020 12:21 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 29, 2020 8:15 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1929
Location: Sacramento, CA, USA
I haven't explored White Flame's WFDis yet, but I believe there is some inspiration to be possibly gained there. He's a generous and talented individual, and may even be here shortly to share some insights that are surely beyond my experience and ability.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 29, 2020 8:58 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3354
Location: Ontario, Canada
load81 wrote:
am I doing this right?
Remember you'll need to develop an understanding of the I/O that's installed -- the addresses where IO registers appear, and the functions of the bits in the regs. Have fun, and keep us posted. :)

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Sun Mar 29, 2020 9:17 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
You're right. It's a standard C64 cartridge, so it starts as $C000 and runs for ~8k. So, there isn't any disk I/O to worry about. That's part of why I picked a cartridge. I'll worry about disk I/O on a future project.

My main areas of interest are obviously going to be:

  • Everything the program is doing in Zero Page and Page 1 (the stack).
  • Branches and jumps inside of its memory space.
  • Every kernal function call it makes.

Anything in ZP is "important" and is easy to spot. But, figuring out what exactly is going on is very likely going to be easier said than done. Thankfully, KERNAL jump points ($FFA5-FF8D) are both easy to spot and well documented. I'm sure I'll need to build a diagram of memory locations, what calls them, and what gets called in order to begin to "grok" even a mere 8 kB of code.


Top
 Profile  
Reply with quote  
PostPosted: Mon Mar 30, 2020 5:31 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8182
Location: Midwestern USA
load81 wrote:
You're right. It's a standard C64 cartridge, so it starts as $C000 and runs for ~8k.

You sure about that start address? See the following excerpt from Mapping the Commodore 64.

Quote:
Two lines located on the exapansion port are called GAME and EXROM. When used in conjunction with the software-controlled lines noted above, these two hardware lines can enable cartridge ROM to replace various segments of ROM and/or RAM.

Possible configurations include 8K of cartridge ROM to be switched in at $8000-$9FFF, for a BASIC enhancement program; an 8K cartridge ROM at $A000-$BFFF, replacing BASIC, or at $E000-$FFFF, replacing the Kernal, or a 16k cartridge at $8000-$C000.

Note that last address range is slightly in error. A 16K cartridge can extend from $8000-$BFFF. The C64 always reserves $C000-$CFFF as RAM.

Also note the following:

Quote:
32768 $8000
Autostart ROM Cartridge

An 8K or 16K autostart ROM cartridge designed to use this as a starting memory address may be plugged into the Expansion Port on the back. If the cartridge ROM at locations 32772-32776 ($8004-$8008) contains the numbers 195, 194, 205, 56, 48 ($C3, $C2, $CD, $38, $30) when the computer powers up, it will start the program pointed to by the vector at locations 32768-32769 ($8000-$8001), and will use 32770-32771 ($8002-$8003) for a warm start vector when the RESTORE key is pressed. These characters are PETASCII for the inverse letters CBM, followed by the digits 80. An autostart cartridge may also be addressed at 40960 ($A000), where it would replace BASIC, or at 61440 ($F000), where it would replace the Kernal.

It is possible to have a 16K cartridge sitting at 32768 ($8000), such as Simon's BASIC, which can be turned on and off so that the BASIC ROM underneath can also be used. Finally, it is even possible to have bank-selected cartridges, which turn banks of memory in the cartridge on and off alternately, so that a 32K program could fit into only 16K of addressing space.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Mon Mar 30, 2020 2:37 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
Quote:
You sure about that start address? See the following excerpt from Mapping the Commodore 64.


That $C000 value was a typo. Everything starts at $8000 with a pair of jump tables and the 'CBM80' magic number. The first real instruction byte is at offset $8009, the sei instruction I mentioned earlier. There is an unrelated section of code that jumps to $C000, but that comes much later.

I sanity checked my disassembly against the VICE machine language monitor to be sure I hadn't fat-fingered anything else nearby. Everything in and around $8000 checks out.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 31, 2020 1:05 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
load81 wrote:
This has been my approach in a nutshell:


My approach is:

  1. Make a copy of the binary image to a different name.
  2. Create a source file consisting of only .byte, fcb or db directives. Put 8 or 16 bytes per line with the starting address of each line in the comment field like you do. I usually write a quick program to do this.
  3. Edit and assemble the file. Iterate until the result matches the original binary.
  4. Starting with known entry points, replace bytes with actual assembly language instructions. I usually capture disassembly within a debugger and replace the corresponding bytes in the source file.
  5. For branches and other references to addresses, create a label at the appropriate spot and use that label in the instruction.
  6. Do that for one block of code at a time.
  7. Assemble that and iterate until the binaries match again.
  8. Analyze the code and try to identify tables and text strings.
  9. Comment liberally everything, including hunches.
  10. Repeat for additional blocks of code.

An actual example:

Code:
 org $A100

A100
 bra A113

 fcb 1 ; Version

A103
 fcb             $00,$00,$00,$00,$00 ; A100
 fcb $00,$00,$00                     ; A108

A10B
 fcb             $00,$00,$00         ; A108

A10E fcb 0
A10F fcb 0
A110 fcb 0
A111 fdb 0

A113
 ldx $AC14 ; Line buffer pointer
 stx A111

 jsr $AD24 ; PCRLF

A11C
 tst A10F
 beq A13A

A121
 ldx #$A840 ; Find next drive
 ldaa #$14
 staa $00,X
 jsr $B406



load81 wrote:
My questions are:

  1. At a certain point, I want to do this to fairly large multi-disk C64 programs. At what point (if any) is it best to partially automate disassembly?


There are disassemblers available for many platforms. I have never used them even though they promise to automate much of this.

load81 wrote:
  1. Is anybody else doing this sort of thing in the C64 space? It seems more common among the Atari 2600 crowd.


I have never done it for the C64. The only 6502 code I have analyzed like this is the firmware for Gold Wings, a pinball game.

load81 wrote:
  1. Is there a good "6502 ASM style guide" of sorts anywhere? I want the end result to be readable so that even a very fresh student of ASM can begin to follow it?


Style varies tremendously from person to person and can become quite controversial. Read some code, pick one you like and stick to it.

load81 wrote:
  1. What's with the first instruction being sei? I very dimly recall reading about this as a common thing to do for cartridges, but I don't know why this is done.
[/list]


Before doing something you do not want interrupted, do not assume; disable interrupts explicitly.

load81 wrote:
Any other constructive or useful feedback is welcome. I've just always wanted to know how this worked at the lowest possible level. I'm sure that this will make me a better programmer even outside of an assembly language context.


You cannot help but learn programming tricks when analyzing others' code.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 31, 2020 1:30 am 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
Thank you for taking the time to share your entire process and explaining sei. Very cool.

I see a decent amount of similarity between your approach and mine. I don't want to split the code into multiple files, yet; as you do. But, eventually I will. You're right, I can't help but learn stuff analyzing code.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 31, 2020 4:23 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Just one note before I start in: "CRT" very frequently means "cathode ray tube monitor" in the computing community, and I've rarely or never head "CRT" used for cartridge. You might want to change the title of your initial post to make it easier for those browsing the board to know what this thread is about. (From the title, I had assumed that this was about hacking on an old video monitor.)

load81 wrote:
At a certain point, I want to do this to fairly large multi-disk C64 programs. At what point (if any) is it best to partially automate disassembly?

In my experience, right at the very start. Basic disassembly is mechanical and, for large amounts of code, very tedious, which is exactly the kind of job that computers are suited for. A disassembler will also identify locations that are the targets of jumps and create labels for them, which is very handy since you can look through that list of labels for entry points at which to start segments of disassembly. (It will also mis-label addresses that look like targets but are actually the result of incorrectly disassembling data; these you want to clean up fairly early on in the process.)

So what I normally do is start with a disassembler that takes a binary file and generates a text file containing the disassembly. (I don't generally use graphical tools; I've found that the hassle they introduce, especially when it comes to their custom data files and interfacing with a build system and so on isn't worth it.) These disassemblers generally take an annotation file as a second input where you can specify what areas to treat as code vs. data and so on.

I start with annotations for the start address of the disassembly, the CPU's system vectors (reset, IRQ, etc.) if those addresses are present, and any other entry points I know about (e.g., the C64 cartridge entry point, in your case). I then examine the disassembly and work from there, adding annotations for obvious data areas, examining the disassemblies for jump targets as they appear to see if they look sensible or not, and so on. I also start assigning my own label names at this point, though these are usually still tenative, and adding comment annotations to have the disassembler insert blank lines between major blocks. (I may start adding actual comments about the structure of the code, too.)

Once this is done I have in the past moved forward with further extensive annotations, trying to continue using the automated disassembler as long as possible, but I'm finding that the annotation file formats and the control they give me over the appearance of the disassembly output are not all that great. So I think I'm going to switch to starting sooner, rather than later, with changing from the disassembly being a target file (the output of the disassembler) to a source file, and no longer using the disassembler tool from that point on. Before doing this change you want of course to add your build system to build an output binary from the source and confirm that it matches the original input binary.

All of this is done in a Git repo, where I keep the source binary (unless there are copyright concerns), all my scripts and the tools I've written, often submodules for disassembly tools (for which I write appropriate build scripts), and usually the target disassembly during the automated stage to make it easier for someone brewing GitHub or GitLab to just go and read it. At the top level I always include a script called Test that checks out and builds any necessary tools, runs the disassembly (if still at that stage), builds the source, and compares the assembled output to the original binary. A developer with a fresh clone of the repo should simply be able to type ./Test at the top level of his checkout and end up with either a working system ready for further development or some error messages indicating fairly clearly what's wrong.

You can see some examples of this in the various repos in the retroabandon group on GitLab. The one that's seen the most work is the Retrocomputing SE Mystery Board Disassembly, which is 6809 code but otherwise no different from doing a 6502 disassembly. I'd suggest walking through the commits from beginning to end (git log --reverse --stat --patch): this well documents, in the commit messages and diffs, my journey from a raw ROM dump for a machine I knew almost nothing about to a mostly-complete disassembly, including the tools I built and used. (This will probably take a few hours, but from it you can learn much more quickly a lot of what took me many dozens of hours of work to learn.)

Quote:
Is anybody else doing this sort of thing in the C64 space? It seems more common among the Atari 2600 crowd.

Well, I do it for all sorts of machines, including C64. You can have a look at the start of a disassembly of the Epyx Fastload cartridge that a friend and I started a while back; it's still in the very early stages (and at least temporarily abandoned, at this point), but you might find the infrastructure and tools there useful. (We use da65 from the cc65 suite to disassemble the code.) Again, it will probably be more comprehensible if you walk forward through the commits, since that's mainly where we documented what the various files are and so on.

Quote:
Is there a good "6502 ASM style guide" of sorts anywhere? I want the end result to be readable so that even a very fresh student of ASM can begin to follow it?

There's no such thing as a one "good style": as with any other kind of writing, good style varies (often drastically) depending on the target audience, and improving code presentation for one audience often makes it worse for others. (I go into this in some detail in this answer on on the Software Engineering StackExchange.)

Take, for example, this 6502 routine to convert an ASCII character to a binary number. It's only nine instructions, and not particularly tricky by the standards of an expert 6502 programmer. I annotate these 19 words (labels, ops and operands) of code
with over 400 words of commentary on how they work because this particular routine is one I use as an example for programmers not fairly experienced with 6502 assembly and/or tricky bit manipulation. But all that verbosity makes it harder for even me to read, and if this were code in project I was working on with experienced 6502 coders such as Dr. Jefyl or chromatix, I would most certainly trim all that verbosity down to just some key hints about what's going on, because that would make it easier for them to read (obviously at the expense of the current target audience).

So when you're writing code and considering the style, ask yourself "Who will be reading this and what do they already know?" Try to tell them what they don't know, but also try to avoid telling them obvious things that they do know because that will just make it harder for them to read.

Quote:
Is there an unofficial "standard library" of sorts for 64 projects that will work with 64tass? I know a lot of the A2600-types use a consistent set of headers for dasm.

I don't know of one (because I've never used 64tass), but I do suggest that you use names and symbols from the C64 KERNAL source and popular disassemblies so that people familiar with those will more quickly understand your code. The standard library that comes with cc65 does this kind of thing.

Quote:
What's with the first instruction being sei? I very dimly recall reading about this as a common thing to do for cartridges, but I don't know why this is done.

Most likely the cart is doing some setup of the hardware, and it's important when touching anything that could generate interrupts that interrupts are not generated when you are not yet prepared to handle them. In the normal situation the C64 KERNAL ROM has already disabled interrupts and decimal mode and set up the stack, but the cartridge code doesn't know that this has actually run: perhaps someone took over the machine, set decimal mode, enabled interrupts and put the stack pointer at $00 before doing a JMP ($8000).

Sure, you could say, "well, we don't support that situation" and that wouldn't be unreasonable. But often (as in this case) it's easier for you as the developer simply to handle pathological situations in a nice way rather than breaking, because in the end you're usually going to be the one debugging any problems that arise. (Even if you say, "we don't support that," confirming that the situation is indeed that one you don't support can be many hours of debugging.)

So most cartridges start with the standard 6502 setup-from-reset code that I mention in my notes on CBM cartridge startup. Here, for example, is the code from the Epyx Fast Load cartridge:

Code:
008030  1               coldstart:
008030  1  78                   sei
008031  1  D8                   cld
008032  1  A2 FF                ldx     #$FF
008034  1  9A                   txs
008035  1  A9 27                lda     #$27
008037  1  85 01                sta     PIO
008039  1  A9 2F                lda     #$2F
00803B  1  85 00                sta     $00
...

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 29, 2020 2:19 am 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
cjs wrote:
So what I normally do is start with a disassembler that takes a binary file and generates a text file containing the disassembly. (I don't generally use graphical tools; I've found that the hassle they introduce, especially when it comes to their custom data files and interfacing with a build system and so on isn't worth it.) These disassemblers generally take an annotation file as a second input where you can specify what areas to treat as code vs. data and so on.

I start with annotations for the start address of the disassembly, the CPU's system vectors (reset, IRQ, etc.) if those addresses are present, and any other entry points I know about (e.g., the C64 cartridge entry point, in your case). I then examine the disassembly and work from there, adding annotations for obvious data areas, examining the disassemblies for jump targets as they appear to see if they look sensible or not, and so on. I also start assigning my own label names at this point, though these are usually still tenative...

I've decided to take quite a bit of your advice. After manually disassembling more than 1k of instructions it just got too tedious. I was going to do all 8k by hand and then try to automate my next project much more. This was an intentional choice on my part, for the educational value. I was right — insofar as the headers and some other unexpected data structures were concerned — but I've hit diminishing returns at this point.

I'm going to start over using radare2. I'll message the output into something 64tass friendly with a script. It will give me an excuse to work on my Raku (ex-Perl 6) at the same time as my 6502 Assembly. Nerd heaven.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 23, 2021 9:04 am 
Offline

Joined: Tue Feb 23, 2021 8:37 am
Posts: 3
Even if the answer comes near an year later, you should use JC64dis for this task (https://iceteam.itch.io/jc64dis).
Version 1.0 else will support cartridge loading directly.

I reverse enginnering many music players over the years for having a commented source code that can compile back to binary. Those require months of handle work for a single player even if I start from a raw dissasembly of the source.

With JC64dis you simple instruct the engine to produce the source in the form you want from what you understand by analize it.

It is an iterative process, like I use when process it by hand, but it produce the result istantanious (I have test many disassembler before coding this, but no one reduce the manual works that is needed when you study the generated disassembly)


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 23, 2021 11:28 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Welcome! Looks like a nice project - and open source.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 23, 2021 2:19 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
ice00 wrote:
Even if the answer comes near an year later, you should use JC64dis for this task (https://iceteam.itch.io/jc64dis).
Version 1.0 else will support cartridge loading directly.

I reverse enginnering many music players over the years for having a commented source code that can compile back to binary. Those require months of handle work for a single player even if I start from a raw dissasembly of the source.

With JC64dis you simple instruct the engine to produce the source in the form you want from what you understand by analize it.

It is an iterative process, like I use when process it by hand, but it produce the result istantanious (I have test many disassembler before coding this, but no one reduce the manual works that is needed when you study the generated disassembly)


Wow! I'm absolutely using this. My disassembly project has been dormant (but absolutely not abandoned) due to real life concerns. Because I'm new to this, I have a difficult time following the reads/writes to custom chips. At a minimum, your disassmebler will help me follow some of that a bit better. Thank you!

I know you said you reverse engineer some music players and that was your motivation. Very cool. But, can you explain a bit about the development process and some of the struggles your ran into? A reply might be worth its own thread.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 23, 2021 3:36 pm 
Offline

Joined: Tue Feb 23, 2021 8:37 am
Posts: 3
Usually the I/o mapped chips help a lot in understanding the code.

For example, with sid music the D4xx range of read/write is the first point I look for.

This is else why the program automatically comments know memory location giving the meanings:

$D418 ; Sid volume and filter mode

I look for instructions that write the D400/D401 area (frequency of notes) and so I look where such area is read (depending from some index value that are the note to play).
That area is the frequency table (JC64dis now try to find it by itself).

Then I go to the other locations (pulse width, waveform control) and so on tring to figure each block of code what it is doing.

I else separate code from data (when you find a lot of illegal instructions, that area are of data type for sure)

You can find many complete disassembly I do in the past in SIDin PDF magazine online if you want to see the final result.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 23, 2021 4:19 pm 
Offline

Joined: Fri Nov 16, 2018 8:55 pm
Posts: 71
ice00 wrote:
Usually the I/o mapped chips help a lot in understanding the code.


So I'm learning... the hard way.

ice00 wrote:
You can find many complete disassembly I do in the past in SIDin PDF magazine online if you want to see the final result.


Cool! Do you have a link to some of these? I'd like to read them over.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: