fschuhi wrote:
Thank you for putting in the time, that was a great intro into how to work with your tool. Very much appreciated.
I need to work more on, and document more about, WFDis. Any use of it automatically gives me a little kick to do so for its own purpose as well.
Quote:
The adhoc emulation with Shift-R is awesome. I didn't know that WFDis can do that. With regard to the task at hand, it was helpful to see how a subroutine address on the stack can be used to access data immediately after the JSR. That's an important idiom to know.
Yes. When it comes to either pre-run snapshots (static analysis) or post-run snapshots (emulation), it's important to understand there is a flow of behavior beyond just the byte dump you happen to be seeing at one time. There might have been intermediate states in the memory footprint before/after, and that requires knowing what the code is actually doing, and why. Just looking at 1 state of memory might not reveal everything.
Quote:
Prepending labels with L and S is sensible, so I have added this right away to my workbench. I would have expected that jumps were labeled with 'J', but you stick to 'L' here as well. Any particular reason why?
No particular reason. "L" was the default, then I added "S" (I used "Sub" originally, but ended up being annoyingly long on earlier pre-html displays). I agree that a prefix for subroutine, branch/jump code, and data accesses would be reasonable.
Quote:
I had originally lowercased mnemonics and addresses (like seen for x86). Most of the snippets on this forum are uppercase, though. In WFDis you decided to lowercase everything. What is your reasoning behind deviating from the common "style guide"? )
Old code written way back in the 1000s was often on platforms with only uppercase available, or only uppercase by default, so there was little choice, and it created an ad-hoc standard. For most actual modern-written code bases, I believe most of them use lowercase. However, when posting code snippets and stating things inline in text (like here talking about LDA label,Y), it visually helps to distinguish the code from normal labels & prose by using uppercase, as ye olde code did. Automated output sometimes keeps uppercase to distinguish it from human-written code.
There've also been studies back in the day of single-case text that lowercase is easier to read than upper, but was determined too disrespectful for certain names and titles which should be Proper Cased, so they went with upper-only. I tend to find it more readable in lowercase as well, when looking at full listings.
Quote:
I lack the skill of keeping the structure of even smaller stretches of code in mind.
In order to get any level of this type of work done, you need to be able to just look at a few instructions at a time to see what they do, independent of the code that surrounds it. Each piece performs some deterministic mechanical step, and doesn't necessarily require the full structure to understand. The small basic blocks that I split up in the video are only like 2-4 instructions long, and are pretty representative of what you need to determine the lowest granularity of documentable functionality. At that point, you should start looking at those small blocks as whole units that interact with each other, instead of always at the instruction level.
Quote:
For this reason I have put most of the work so far into automated structural analysis of the code. I do this in Python, because the emulator is written in Python, too. BTW Excel doesn't play a big role in the project, for me it's just a convenient notebook with additional intelligence and easy interface to the emulator and tracer.
I so far find structural analysis to be a bit of a red herring with 8-bit code. There is no ABI or anything that code needs to follow, except maybe calls to ROM routines. The control flow especially in games is convoluted for speed, not clarity, and most code takes shortcuts to make it easier to write as well. Timer or video-based interrupts can do weird mutations to code and data, because they need to be fast, and obviously don't fit in clean control flow graphs. Since there are so few registers, lots of transient state is passed around in memory, reusing memory locations for multiple purposes at multiple times.
So it doesn't make a lot of sense to me to try to map all of this hand-wrangled bit banging into some clean single model of execution. But that also depends on what you mean by "structural analysis".
Quote:
I think using a manual disassembler is compatible with my reverse engineering approach. I lean on the ideas in Don Lancaster's "Tearing Into Machine-Language Code". He advises against using tools to disentangle the code and rather advocates "Do the dull stuff yourself!". But his method also assumes that those who want to reverse engineer should be versed in 6502. A bit more automated structure discovery is necessary to help my learning.
As I mentioned above, I don't believe that the code/data separation is a significant portion of the work. Having static tracing or emulation traces really assists in those steps, though not without their faults (both can miss code areas, for instance), but it's relatively easy to deduce which portions of the data are code:
- "Holes" of uncalled data surrounded by large code areas are often also code.
- A9 is "LDA #xx" which very commonly starts code paths.
- In my video example where one of the pointers went to an area with something like "00 00 10 20 30 ff ff ff" I assumed that it was not code, because it appears more structured like bitwise data.
With static tracing from some starting point (or emulation traces), it saves a ton of
tedium, but the majority of
time is still spent trying to glean understanding from the code, not the actual code/data separation.
What I find the most useful by far is giving names to things, especially variables and subroutines. Even when they're just guesstimates, once you name something and look at its various uses you can piece together a picture of what it's for, in a somewhat bottom-up fashion, but you need to focus on things that will reveal the most. There's a few good starting points to look at that can anchor some of this understanding:
- Accesses to well-known I/O addresses (keyboard/joystick inputs, video registers)
- Accesses to screen memory
- Functions that are called from many places (usually indicates main loop or small utility functions)
- Writes to known system or software vectors (which will generally point to code)
- Calls into ROM
- Initialization code can reveal the overall memory layout
These can help to name some of the primitives that the program is constructed from, which can start to make the overall structure more visible/readable.
Quote:
I'm going to continue working on the analysis over the weekend, let's see if I am able to share interim results. I would certainly consider transferring at least some of the automatically generated info to the listing in WFDis (e.g. caller/callee), so it would be nice if your next version re-enables inline comments. But that's just a nice to have, the tool is certainly powerful as it is. Thanks again for the help!
It can import label files, but not currently comments. The internal rewrite I'm still working on shouldn't change the current appearance or functionality much, but is necessary for future expansion and things like standalone and multi-line comments (right now all comments are tied to a single line of code as well). Cross-reference data is easily stored, but still remains on the TODO list for exposing graphically.