6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 10:27 pm

All times are UTC




Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Jul 23, 2012 8:12 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
Hi there,

I have got a mail notification from the poster of that interesting project: a 6502 decompiler:
http://www.indiegogo.com/6502
to "decompile" 6502 machine code into C - even code that had been written purely in assembler.

with a demo here
http://6502.emulinks.de/
which explains some of the pitfalls the author wants to fix by working on the pledge.

I think the project is interesting and ambitious - I'm looking forward to the results!

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 23, 2012 10:47 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
It looks like a very ambitious project. I'm partly thinking about all the spaghetti code that would have to not only be transferred to a higher-level language, but the decompiler would have to figure out entirely different approches to arrive at structures, if it is going to go so much further than just a disassembler. I will be anxious to see what secrets might be discovered. With such a huge software base, there's bound to be some outstanding method someone kept to themselves, either on purpose, or simply never got around to publishing it.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 1:31 am 
Offline

Joined: Tue Jun 26, 2012 6:18 pm
Posts: 26
This won't work, I'm afraid. I've reverse engineered _a lot_ of programs and based on that experience I can only say that such a task cannot be done for several reasons.
First of all, if you take a look at the examples on that website, you soon see the great disadvantage of this approach: it does not add any semantics. It just exchanges one obscurity for another. Looking at the C code you still don't know what it means, even worse, what might have been readable for an experienced 6502 programmer turns into some cryptic gibberish. You do not gain meaning, you lose it. So why do it?
The very idea of a decompiler is based on a (old) misconception how language works, how syntax and semantics relate to each other. While this, sorry to say so, is typical of a computer scientiest, a linguist will only shake his head over this rather naive idea. While there is Turing to help us emulating a program code there is also Goedel who tells us there are unavoidable pitfalls in the way referencing is done. Try to translate the following sentence into French: 'This sentence will have a different meaning when translated into French.' And now translate the following expression into C:
Code:
      lda   #$60
      sta   label
label:      jmp   label
Want some real examples? Here we go...
A lot of programs that deal with 3d use look-up tables for fast multiplication. If the program uses logarithm look-up tables, then how will the decompiler know what the following subroutine actually does?
Code:
      clc
      lda   $4000, x
      adc   $4000, y
      lda   $4100, x
      adc   $4100, y
      bcc   label
      tax
      lda   $4200, x
      rts
label:      lda   #0
      rts
A good conversion to C should result in
A = (X * Y) >> 8;
Bet, the decompiler won't do that.
Then, of course, there is the problem of self-modifying code. Prince of Persia, for example, will patch its drawing routines with instructions AND, ORA, EOR, or STA to generate different effects.
Code:
      lda   midOP, x; X is a shape index
      sta   OPACITY
      ...
; later in the subroutine
      ldx   OPACITY
      lda   OPCODE, x
      sta    :smod; patch instruction!
      ...
      lda   (IMAGE), y
:smod:      ora   (BASE), y
      sta   (BASE), y

OPCODE:   dfb   $31            ;   and (oper),Y
      dfb   $11            ;   ora
      dfb   $91            ;   sta
      dfb   $51            ;   eor
      dfb   $31            ;   and
      dfb   $91            ;   sta

Now how will the decompiler handle this?
Furthermore, not all games are written in pure 6502 assembly language. There are adventure games that use their own action tables, and there are a lot of games that use some form of bytecode for their logic. Just to name a few:
- Zork I
- Maniac Mansion
- The Guild of Thieves
- Wizardry
- Dragon Wars
Infocom adventures use their own virtual machine known as the Z-machine.
Lucasfilm Games adventures come with SCUMM.
Magnetic Scrolls adventures are actually written in some kind of 68000 subset, so when the C64 version runs 'The Pawn', 'The Guild of Thieves', 'Jinxter' etc it runs an emulator that basically emulates 68000 machine language (I'm not kidding).
Wizardry was written in UCSD-Pascal and runs on a virtual stack machine, and Dragon Wars uses a virtual 8/16 bit processor.
Okay, you could say: that doesn't count, these are adventures, not 'real' games. How about the game 'Mercenary' (C64) then? It uses its own bytecode which handles the 'Benson' messages and will also enable you to buy and sell things on different locations. This bytecode has direct access to the 6502 memory and it uses this access to set flags, e.g. to flip the X values (used in 'The Second City'). And now? Even a tiny arcade game like 'Buck Rogers - Planet of Zoom' uses bytecode to move the aliens around the screen. And this bytecode can do GOSUBs and GOTOs as well. And this means that the address of the jump target is vital for the bytecode to function properly. To decompile this, you would have to understand the meaning of this bytecode. You need to know that certain bytes of the bytecode (which just look like data to the 6502 decompiler) are in fact pointers to different code sections.
If you really want to handle this automatically without knowing the meaning of the bytecode, you somehow have to emulate the 64kb memory of the 6502. But when you start emulating the memory, then why not emulate the whole machine? One may not forget: games were written for a specific machine in terms of video and sound. C64 games rely heavily on sprite collision detection or sprite multiplexing. Prince of Persia (AppleII) needs the high bit set in its shapes (possible on the AppleII resulting in four colours black, white, orange, and blue) so that it can shift and mirror the shapes easily. How should one decompile this?
Conclusion: sorry, but I only can say this: 'Decompiler? Impossible!'
Cheers
Miles


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 2:39 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
I wholeheartedly concur with Miles. Having done a ton of 6502 reverse engineering as well, I noted the repetitiveness of what I was doing in order to try to understand the code, and sought automation. It is absolutely an AI problem, not one of merely "smart" conversion.

I actually don't consider it impossible for a large majority of cases (including selfmod code examples listed above), and would like to eventually get a proper AI framework set up in order to tackle conceptualization of assembly language in context of a full machine (I build commercial AI frameworks for a living), but this is nothing that $48k will hope to touch. It's a massive undertaking, and nothing on his page indicates that he has any breakthrough that would allow him to get through it, nor does he even acknowledge the difficulty of automated reverse engineering. The output of a straight asm-to-C conversion will be as unreadable (or moreso) than the original asm.

edit: I had a look at one of the example disassemblies, and yeah that's pretty much the case. There's simply no way to directly convert the stack return address trickery used, nor clever usages of the carry flag & bit-shifting instructions in C, as noted in his comments. He ties their solution to specific buzzwords, but that's only the tip of the iceberg regarding the challenges he faces.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 3:57 am 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
Eh. It's fair to criticize a decompiler for it's potential lack of understanding of 6502 nuances, but blaming a 6502 decompiler for not being able to decompile byte code is a bit much. Of course it can't.

What you'd end up with is a decompilation of the run time, and blocks of data representing the byte code. The decompiler would be mostly savvy enough to capture the distinction between "code" and "data", even if the data itself is also "code" (just not 6502 code).

A basic decompiler would have a mess of a time decompiling a Forth interpreter. Especially automatically. You'd be lucky if it got past decompiling the most basic primitives, and yet it would still miss a majority of the code words since at start up, Forth is pretty much entirely data driven.

But reverse engineering never works alone, there's typically a person guiding it and using the tools as appropriate.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 5:05 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8507
Location: Midwestern USA
Miles J. wrote:
This won't work, I'm afraid....'Decompiler? Impossible!'

I seldom use the word "impossible" when it comes to computers, but I'd have to agree with Miles that this would be a very tough nut to crack. I can think of many instances where a "decompiler" would fail. A simple example would the use of the PRIMM subroutine in the Commodore C-128 kernel (also posted in the code archives in several forms):

Code:
         jsr primm
         .byt 'Text printed by PRIMM.',$00

How would the decompiler know that the bytes following the PRIMM call are text, not code? It wouldn't unless it was able to recognize a call to the C-128 kernel when it "sees" one. If that were the case, then it would no longer be a general purpose tool.

The fact is, many 6502 programs were written with data tables scattered everywhere, and few have them neatly ensconced at the end of the program (the Commodore kernel is full of such code). Again, how would the decompiler know this unless it already had detailed knowledge of the machine code and the system on which it is/was run? Or, what about the use of one of the BIT opcodes to produce "blind" jumps over pieces of code? What would that look like to a decompiler? In order for it to get that sort of information, someone (human) would have had to have written it into the decompiler. If someone is going to expend that kind of effort and has that level of programming talent, they might as well produce a commented disassembly of the program of interest and publish the results so others can, if they're so inclined, write the equivalent in C (which should be less difficult than what it took to develop the original).

While I would never discourage anyone from working on whatever it is that floats their boat, I stop at the point where funding is requested for something that will most likely prove to be an exercise in futility.

BTW, the author at that site said this:

    Machine code is really hard to grasp because it is entirely void of any sort of structure. It is simply not designed to be understandable.

Machine code is not "designed" to be anything, other than executable. I had to chuckle when I read that, since I, as well as numerous others, obviously do understand this stuff.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 8:36 am 
Offline

Joined: Tue Jun 26, 2012 6:18 pm
Posts: 26
White Flame wrote:
Having done a ton of 6502 reverse engineering as well
Wow! Brilliant! Would you mind if I'd asked you which programs?
whartung wrote:
Eh. It's fair to criticize a decompiler for it's potential lack of understanding of 6502 nuances, but blaming a 6502 decompiler for not being able to decompile byte code is a bit much. Of course it can't.
But that's the point. A language provides you with the opportunity to put a completely different language construct on top of it, something that is not included in the original (assembly) language and cannot be mathematically identified and handled on the lower machine language level, especially not when it comes to self-referencing. Without an analysis of the bytecode your task is simply not done. Bytecode is essential for the program to work, and all interesting programs I know of so far use some form of bytecode in one way or another. (E.g. FSII, The Jet, Thunder Chopper all based on the sublogic engine use drawing lists to calculate their 3d graphics primitives). And if it's not bytecode than there are still a lot of pointers inside the data which also need translating.
whartung wrote:
What you'd end up with is a decompilation of the run time, and blocks of data representing the byte code. The decompiler would be mostly savvy enough to capture the distinction between "code" and "data", even if the data itself is also "code" (just not 6502 code).
I don't need a decompiler for that. A while ago I wrote a little automated disassembler which helps me in reverse engineering programs. I can put together a list of commands like
Code:
@label 'vertex_rotate' := $87e4
@collect $4000
@checklabel
@set   $800
@write   $bfff
which after less than a second gives me a formatted disassembly listing of the whole memory, which also separates code from data. In addition, I can instruct the disassembler to handle those PRIMM subroutines BigDumbDinosaur mentions (a method used in other programs like Ultima IV as well) and other things. No need for a special 'decompiler.'
whartung wrote:
A basic decompiler would have a mess of a time decompiling a Forth interpreter.
Same for the p-code interpreter of the UCSD-system. It uses the 6502 stack as an evaluation stack. Decompiling...?
whartung wrote:
But reverse engineering never works alone, there's typically a person guiding it and using the tools as appropriate.
And this person usually works best with a disassembler.
BTW unfortunately I'm not a millionaire. Otherwise I would like to spend $3000 on this:
Ulrich Hecht wrote:
$3,000 Custom reversing
You'll get updates, a note, a 6502, git access, and you can name one historical 6502 program that I will personally reverse-engineer completely: I will decompile it, give functions and variables sensible names, and add comments, until the result is as good as hand-written C code.
'$3000?' Now that sounds rather cheap to me. It's like saying: 'Oh, I gonna translate 'War and Peace' for you for just $3000...' I don't think Dr. Cat received only $3000 for doing his C64 conversion of the original AppleII source code of Ultima V. At least I know, I wouldn't do it. Translating a program like 'Elite' or 'Mercenary' to C (as good as hand-written C code!) can take you quite a long time. Sorry for sounding so harsh (I usually don't do this), but I don't get the feeling that Mr Hecht really knows what he's talking about... :(
Cheers
Miles


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 4:38 pm 
Offline

Joined: Tue Jun 26, 2012 6:18 pm
Posts: 26
@White Flame
Sorry, just noticed you joined today. Welcome to the forum!


Only for fun I fed the prototype decompiler with the following code
Code:
  $0: a0 0b     ldy   #$b
  $2: a2 00     ldx   #0
  $4: 84 09     sty   $9
  $6: 86 0a     stx   $a
  $8: 4c 0b 00  jmp   $b
   
  $b: a0 11     ldy   #$11
  $d: e0 01     cpx   #1
  $f: 90 f3     bcc   $4

 $11: a9 12     lda   #$12
 $13: e8        inx
 $14: d0 ee     bne   $4
 $16: 4c 00 00  jmp   $0
   
 $19: 60        rts
 ; insert some other code here
 $1a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $2a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $3a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $4a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $5a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $6a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $7a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $8a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $9a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $aa: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $ba: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $ca: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $da: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $ea: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 $fa: 00 00 00 00 00 00
$100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$110: 00

$111: 69 34     adc   #$34
$113: a0 19     ldy   #$19
$115: ca        dex
$116: 4c 04 00  jmp   $4
The result was:
Code:
extern uint8_t mem_000a;
extern uint8_t mem_0009;
void start_0000()
{
    uint8_t y;
    uint8_t x;
label_0000:
    y = 11;
    x = 0;
    do {
        do {
            mem_0009 = y;
            mem_000a = x;
            y = 17;
        } while (x < 1);
        x = x + 1;
    } while (x);
    goto label_0000;
}
However, what the code does is:
Code:
      clc
      lda   #$12
      adc   #$34
      rts
What does it tell us? The prototype decompiler does not detect self-modifying code yet(!), and, to be frank, I don't see how even the most sophisticated decompiler could handle this piece of mean, ugly, nasty code properly. As you can see there is no added data involved, just pure 6502 code. And yet, under the surface (or rather on top of it) there is another layer of threaded code with the address values of the routines used as opcode values. Since assembly language is an interpreted language, code that at first glance looks static (with the exception of the patched JMP instruction) can easily produce a continuous state change. The generated static C code in contrast depends on one single state of the memory. If it wants to imitate the state change of the 6502 program, it has to convert the code flow somehow into some data flow which (in case it will ever be successful) will result in ... exactly, you've guessed it: an emulator. So, what was the point exactly?
Oh well.. Enough of this silly rant... Where is the weekend when you need it...?

Cheers
Miles


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 5:46 pm 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
If the decompiler performs a static analysis of the code it will never correctly figure out what is going on in self-modifying programs. You'd have to emulate the CPU and its hardware to build up a more dynamic picture of how the code executes and changes over time.

However, judging from the video the main target for this decompiler is ROM based code in game cartridges so its not likely to come across anything too horrible. It should be able to reverse engineer a high percentage of most games.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 8:20 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
This is indeed a very open-ended project. I can't even think how one might specify exactly what the perfect decompiler would do. The aim, clearly, is to move incrementally to ever-better decompilers, presumably picking off the tractable problems found in common code. Much like static analysis for defects, it's not something which is ever finished. But with a budgeted 9 months of full-time effort, could be some significant result.

Note that Ulrich has already helpfully provided 3 links for people arguing that it can't be done: he's not unaware of these points.

The big problem seems to me to be the budget: he needs quite a number of supporters willing to donate.

(The demo is interesting. Note that Ulrich is responsible for tcc-65816)

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 10:05 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
BigDumbDinosaur wrote:
[size=115]
Miles J. wrote:
This won't work, I'm afraid....'Decompiler? Impossible!'

I seldom use the word "impossible" when it comes to computers, but I'd have to agree with Miles that this would be a very tough nut to crack. I can think of many instances where a "decompiler" would fail. A simple example would the use of the PRIMM subroutine in the Commodore C-128 kernel (also posted in the code archives in several forms):

Code:
         jsr primm
         .byt 'Text printed by PRIMM.',$00

How would the decompiler know that the bytes following the PRIMM call are text, not code? It wouldn't unless it was able to recognize a call to the C-128 kernel when it "sees" one. If that were the case, then it would no longer be a general purpose tool.


Simple, you tell the decompiler that the binary is for a C-128 so that it can apply known heuristics to the process (such as "what PRIMM does").

I assume that PRIMM take the return address from the stack, prints out the text following it up to the $00 byte and then JMPs back to the end.

That wouldn't necessarily be a difficult platform specific rule to add to a decompiler.

Just like if the compiler sees a block of text that's not JMPd, JSRd, or falls in to (via the PC just running in to it), but also all happens to be ASCII, ending with either a hi-bit ascii or a null, that perhaps that's a 'Some String' vs a blob of hex codes. It's fair guess lacking any other information.

Over time, more generic rules as well as platform specific rules "Hey I saw this address passed to an IO routine, it's probably an IO block structure".

Lots of stuff can be done statically. If a simulator is added, then even more flow analysis can be done. And it doesn't have to be done on the entire program all at once. The user can point it at a piece of code and say "try this out and see what you can learn", and away it goes.

Do not underestimate what a tool like this can be capable of given time. One of their nice features is that once they learn something, they never forget. They just need to be guided by an experienced hand.

Finally, not all of the code is written obscurely. It's meant to be understood by humans as well, and folks tend to write clearer code over obscure code when they possibly can. Unlike the machines, we forget stuff all the time. Things like "now, how did this work again?".


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 24, 2012 10:25 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
Miles J. wrote:
White Flame wrote:
Having done a ton of 6502 reverse engineering as well
Wow! Brilliant! Would you mind if I'd asked you which programs?


Basically Commodore 64 games & scene stuff, helped out a lot with others trying to figure out copy protections, compression algorithms, etc.

I also started on WFDis back in 1996 or so (wow). It wasn't great but is at least interactive and did a lot of things that I'd not seen any other disasm do back then. I've since done a lot of analysis work on what it would take to have the machine perform reverse engineering, instead of just displaying information for a human to reverse engineer from. That's actually what drove me to get into AI.

BigEd wrote:
This is indeed a very open-ended project. I can't even think how one might specify exactly what the perfect decompiler would do. The aim, clearly, is to move incrementally to ever-better decompilers, presumably picking off the tractable problems found in common code. Much like static analysis for defects, it's not something which is ever finished. But with a budgeted 9 months of full-time effort, could be some significant result.


The problem is that the very architecture he's working on won't get him any farther than anyone else has gone, regardless of time spent. You cannot just add new features to a project like this and expect it to get progressively smarter towards an ideal; it needs some very fundamental and powerful conceptualization basis for everything to be written on, to allow all the weird combinations of indirection, reuse/overlays, timing dependencies, etc, to be generally considered. Plus, it must "know" some useful description of the system and actually how it works to be able to transform the code, instead of just blindly labeling things. This machine knowledge must combine with the knowledge of what the program is & does, and becomes a very fluid, dynamic mass of associations that are next to impossible to delineate and manually write code to manage each situation.

At some point, adding in more knowledge via explicit tests and exceptions turns even a slick, working proof of concept into a buggy, unwieldy mess.

However, there is value to a system that only goes somewhat smarter than most disassemblers. It just can't be expected to grow into a fully aware decompiler, especially for the idiosyncracies of the platforms involved.




edit:

whartung wrote:
Simple, you tell the decompiler that the binary is for a C-128 so that it can apply known heuristics to the process (such as "what PRIMM does").
...
Over time, more generic rules as well as platform specific rules "Hey I saw this address passed to an IO routine, it's probably an IO block structure".


This is an example of what I'm saying doesn't work. What if somebody wrote their own PRIMM, and added another minor feature so it's no longer recognizable to the system? What if somebody has slightly different IO features in their system? All's you're giving it is very limited binary matches, not more usable knowledge. These examples are not some esoteric spaghetti mess, it's just the difference between it knowing how you're manipulating the stack on an instruction-by-instruction basis vs a rote response to PRIMM specifically. Trying to inform it of enough of these explicit tricks and rules is impossible, even for it to understand very "normal" code.

Since it's also converting to another language, any misunderstanding it has will lose a ton of information about what the original program actually did, instead of just having a confusing 1:1 copy of the original pass through.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 25, 2012 1:08 am 
Offline

Joined: Tue Jun 26, 2012 6:18 pm
Posts: 26
whartung wrote:
Simple, you tell the decompiler that the binary is for a C-128 so that it can apply known heuristics to the process (such as "what PRIMM does").
In 'The Bard's Tale III' Rebecca Heineman compressed the text output following the JSR instruction. This code found at $7b3a (AppleII)
Code:
$7b3a:      jsr   $7439
      hex   f4,d2,68,98,22,89,83,29
      hex   89,51,90,52,68,9b,61,43
      hex   d4,e4,07,fa,52,1f,d4,af
      hex   b3,bf,ae,03,e2,bc,37,7d
      hex   03,65,49,21,9b,c3,31,2c
      hex   c1,9a,4c,df,f1,de,80
will first clear the output window and then write the string 'there are stairs here, going /up down . do you wish to take them?\r\r'.
whartung wrote:
It's fair guess lacking any other information.
Unfortunately there is no way that you can guess what this code does. The string compression was invented by Burger Becky, which means there are no 'known heuristics.' Instead you will have to set up a 'specific rule' because otherwise the decompiler does not know what is code and what is data. But this means you have to analyse the output routine before you can decompile it. Doesn't make sense, does it? :)
whartung wrote:
Finally, not all of the code is written obscurely.
Correct, but lots of code was not written in pure 6502 assembly language but some language the programmer created on the fly. If the task is to preserve old programs of that area we need to understand that these programs are more than just machine code binaries. The basic assumption of this whole undertaking is already seriously flawed.
White Flame wrote:
The problem is that the very architecture he's working on won't get him any farther than anyone else has gone, regardless of time spent.
Indeed, no matter how much time you spend on building a perpetual motion machine you will never come to an end because it violates fundamental laws of physics, and in a similar way that's what Mr Hecht and others attempt. The idea behind the decompiler project is: all we need is a vocabulary and a set of rules, and then we can translate a sentence from language A into a sentence from language B. This idea was proposed somewhat around the 1950s but was soon dropped when scientists realised that there is more to language than just a set of rules and a list of words. Since then a lot of people (linguists and AI scientists) have discussed this problem but so far noone has found a real solution for it. And while I'm probably one of the few linguists who do not for example agree with Searle (http://en.wikipedia.org/wiki/Chinese_room says: understanding is generally impossible) I doubt that Mr Hecht is able to come up with an idea where all the great minds failed.
BigEd wrote:
Note that Ulrich has already helpfully provided 3 links for people arguing that it can't be done: he's not unaware of these points.
The arguments presented do not discuss important things like self-modifying code or bytecode or self-referencing. Judging from the links I do not get the impression that the author has a concept of how language works (syntax, semantics, pragmatics etc), even if it is "just" a formal language.
White Flame wrote:
it needs some very fundamental and powerful conceptualization basis
Alas, I can see none.
White Flame wrote:
Plus, it must "know" some useful description of the system
Indeed, "knowing" is the key. How do you teach the decompiler about the programming context? If I tell you that my hovercraft is full of eels, you will either don't understand what I'm saying at all (though the syntax is correct) or you will "know" that I am referring to http://www.youtube.com/watch?v=akbflkF_1zY because you understand the reference with the aid of your world knowledge. What the decompiler attempts is to understand 'my hovercraft is full of eels' by looking at the letters 'm-y- -h-o-v-e-r-c-r-a-f-t-..'. It has no idea of any context yet, and Mr Hecht does not tell us how he will create the (huge) data base to provide his decompiler with enough information.

"White Flame wrote:
Basically Commodore 64 games & scene stuff, helped out a lot with others trying to figure out copy protections, compression algorithms, etc.
Good to see there are still people actively supporting the C64!
"White Flame wrote:
I also started on WFDis back in 1996 or so (wow). It wasn't great but is at least interactive and did a lot of things that I'd not seen any other disasm do back then. I've since done a lot of analysis work on what it would take to have the machine perform reverse engineering, instead of just displaying information for a human to reverse engineer from. That's actually what drove me to get into AI.
Sorry, I've never come across that program. Well, I bet you know more about decompiling then than Mr Hecht. :)

Cheers
Miles


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 25, 2012 4:20 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8507
Location: Midwestern USA
Miles J. wrote:
Good to see there are still people actively supporting the C64!

I have a C-128D here with a Lt. Kernal hard drive subsystem here, and it all works. And I'm still impressed with what Lloyd Sponenburgh and Roy Southwick accomplished in developing that subsystem. I'd love to see someone try to diassemble, let alone, decompile the Lt. Kernal DOS. :P

There is a lot of obfuscation in that system, some of which was necessary to prevent piracy (there were two piracy attempts in the late 1980s that I know of—neither of them successful). However, a lot of the obfuscation had to do with trying to cram a minicomputer-like operating system into an eight bit machine with only 8 KB in which to do it. The inevitable result was a lot of self-modifying code, overlays read in from specific disk locations, tons of hardware bit-twiddling to map shadow RAM (where the LK-DOS lived) in and out of processor space, etc. Adding insult to injury, the Lt. Kernal was a SCSI system and the SCSI bus had to be bit-banged—there weren't any SCSI ASIC's around back then to off-load the bus protocol. The core of the Lt. Kernal substantially emulated the Point 4 IRIS operating environment, which ran on one of the first minicomputers to use SASI to communicate with disks and tapes. The LK even supported keyed index files, themselves devilishly difficult to implement on a 8 bit MPU running at 1 MHz.

While I wouldn't say turning all that back into assembly language is impossible, the question I would have is, "Where do you start?" It isn't like a typical computer program, where there is a beginning, middle and end. Just the startup alone, which involved looking for a disk, figuring out the disk geometry (it was all ST-506/412 hardware in the early days), loading the DOS one piece at a time, and then making the entire mess not get in the way of the BASIC interpreter but still respond to typed commands, such as copying a file, running a program, etc., was terribly complex—far more so than the most complicated game ever written for 8 bit Commodore hardware.

Now, suppose someone did succeed in rendering the LK-DOS into C source code. What good would it do? Due to the small space in which the LK-DOS has to run, it's unlikely the most efficient C compiler on the planet could produce machine code that would be sufficiently succinct to work. Back in the day, I was very much involved with Lt. Kernel database development and had access to information that Fiscal Information was normally unwilling to disseminate. I also had many phone conversations with developers Sponenburgh and Southwick and heard all the accounts of what they went through to get it to work on the C-64, let alone on the C-128. They often spent days on one routine trying to contract it to where it would fit in the available space and run as fast as that 6510/8502 could run it.

I seriously doubt anyone will ever be able to reduce the LK-DOS to C—same for a lot of other "outer edge" C-64/C-128 programs written to squeeze the last drop of performance from the system. There is way more abstractness in it than can be imagined and despite heroic efforts by many brilliant people, machine intelligence is still at the level of a well-trained donkey.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 25, 2012 3:38 pm 
Offline

Joined: Fri Jun 27, 2003 8:12 am
Posts: 618
Location: Meadowbrook
Am willing to bet this project is being put forth by a C programmer type who is not so familiar with the basics and nuances of assembly language and is merely seeking an easy reverse engineering by automate means. This would be impossible since programing is very much of an art with each programmer as style. An automated decompiler translator would not do so well, as BDD's example above shows nicely.

Miles: Thanks on bringing up Becky. did music for her Dragon Wars and Ultima. I owe my own programming skills to her.

And for my own bit of reverse engineering of software, will be translating over a slot machine program over to my pinball board (a vegas slot machine, not a video version) The original is written in 8039 but is well commented. I plan to use it as a jumping off point for understanding the various functions the original program does then write my own version.

_________________
"My biggest dream in life? Building black plywood Habitrails"


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: