6502.org • View topic - Code optimization !?

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

Code optimization !?

Page 1 of 2

[ 24 posts ]

Go to page 1, 2 Next

Previous topic | Next topic

Author

Message

NotVeryBright

Post subject: Code optimization !?

Posted: Sat Aug 07, 2010 12:19 pm

Joined: Sat Aug 07, 2010 12:09 pm
Posts: 1

Hello all,
I have a small 8K binary file that runs a game on an old 6502 system. I am interested to know if anyone out there can disassemble, optimize and re-compile this file so it can make the best use of the up-rated w65c02s cpu ? I am willing to pay for this service if anybody has an interest.
Regards, sean.

Top

BigEd

Post subject:

Posted: Sat Aug 07, 2010 1:01 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

If you want faster code, I would say the first step would be to profile it, which means you (or someone) needs to get it running in an emulator - an open-source emulator. Unless you have access to a fancy logic analyser.

I expect the only realistic chance of getting any improvement is if there are one or two hotspots, and only then if the limited CPU differences happen to be useful.

Most likely, running on hardware which is clocked faster would be a better bet than re-coding anything. Even an emulator will probably run faster!

What kind of game is it, what kind of hardware does it run on, and can you make it available for download somewhere?

Cheers
Ed

Top

kc5tja

Post subject:

Posted: Sat Aug 07, 2010 2:38 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

It should also be pointed out that games tend to be hard real-time programs. Changing instruction sequences can alter how long things take in the game, and that means game play itself usually ends up becoming too fast for a good experience, or even to play at all. This was something learned the hard way when games written for the IBM XT became unplayable on IBM ATs with "turbo" mode enabled. In fact, the whole reason turbo switches existed at all was to bring older games back down in speed.

Top

Dr Jefyll

Post subject: Re: Code optimization !?

Posted: Sat Aug 07, 2010 3:54 pm

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada

NotVeryBright wrote:

[...] re-compile this file so it can make the best use of the up-rated w65c02s cpu

Are we correct in assuming your goal is to achieve faster performance? If so, kc5tja's comment is valid. Also, video and other System timing might be disrupted by a speed increase. But perhaps there's some other reason that the 65C02 interests you -- you haven't said.

If your objective is greater speed, how much faster do you hope to go? If the hotspots are re-written to exploit 'C02 capabilities you might achieve a modest performance boost (ballpark 20% to 50% maybe; without seeing the code it's hard to say).

The 65CE02 may be a better choice for you than a 65C02. These are no longer commercially available but there are still a few out there in the hands of individuals; in fact I have one myself. The 'CE02 is a fascinating chip with many advantages but in your case what's important is its ability to run existing, un-modified code in substantially fewer clock cycles than a 6502 or 65C02. Then, if you also choose to optimize the code, you will have two factors working in your favor.

[edit: minor trifles; lower estimate for speedup]

Last edited by Dr Jefyll on Tue Aug 10, 2010 5:58 pm, edited 1 time in total.

Top

BigEd

Post subject:

Posted: Sat Aug 07, 2010 4:27 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

It's not entirely in the spirit of speeding up retro hardware, but note that an emulator running on a relatively modern PC can do 600MHz or more.

(I was timing NOPs emulated by lib6502 which is probably the least applicable benchmark, but perhaps gives an idea. I ran again with

Code:

STA (zp),Y

and got 260MHz on one machine, 800MHz on another.)

To the point about not necessarily wanting to speed up a realtime game: yes indeed, but for a turn-based game, whether a card game or a board game, a simple speedup might be useful or even able to strengthen the play.

Cheers
Ed

Top

GARTHWILSON

Post subject:

Posted: Sat Aug 07, 2010 7:35 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Quote:

the whole reason turbo switches existed at all was to bring older games back down in speed.

Wow-- learn something new (or in this case, old) every day. I always thought they were for if you plugged in an older card or memory that couldn't keep up.

I have a 65CE02 or two but when I tried it as a drop-in replacement on my workbench computer, it didn't work. I never tried to figure out why. The 'CE02 had at least 30 op codes that took only one clock though, and eliminated a lot of other dead bus cycles, speeding things up even if you didn't alter the code or turn up the clock speed.

Maybe someone here knows if games were frequently written to use illegal op codes on the original NMOS 6502. I think GEOS did that to increase performance, but then it could not run on the CMOS 6502 which claimed those illegal op codes and gave them new functions.

Modern 6502's can run at much higher clock rates, but that doesn't mean the rest of your hardware can handle it. Since the 65802 is no longer available, we had kicked around the idea here of making a board with a 65816 on it that could be plugged into a 6502 socket, but we never did it. But with the '816, if you can really figure out what the program is doing and re-write it to take advantage of the 816's added capabilities, you can get a substancial speed-up at the same clock speed. My '816 Forth runs 2-3 times as fast as my '02 Forth at a given clock speed, and of course the clock can also be 20 times as fast as that of most of the popular 6502 home computers of yesteryear.

I would emphasize however that figuring out what a piece of software is doing, without having the source code, is a huge job. What comes to mind as I write this though is that some of those old games, if they are like a lot of very unprofessionally written, so-called "educational" software that was out at the time, might be written in BASIC. I never paid any attention to the games world to know.

Top

kc5tja

Post subject:

Posted: Sun Aug 08, 2010 2:14 am

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

I received a private message indicating the game in question was an implementation of chess.

GARTHWILSON wrote:

Maybe someone here knows if games were frequently written to use illegal op codes on the original NMOS 6502. I think GEOS did that to increase performance, but then it could not run on the CMOS 6502 which claimed those illegal op codes and gave them new functions.

I think the motivating factor for GEOS wasn't performance, but rather compactness. GEOS consumed a lot of memory, and barely fit in the 16K space allotted to it in the Commodore 64 ($C000-$FFFF). Remember, not only did it fit a GUI, it also fit (in effect) a new DOS, the event loop architecture, device drivers for printer, mouse or joystick, the bitmapped display itself, disk storage, and if equipped, RAM-disk storage via external memory expansion.

A 65816-compatible implementation of GEOS now exists, called Wheels, IIRC. It's built for the SuperCPU.

Top

Ruud

Post subject:

Posted: Sun Aug 08, 2010 7:53 am

Joined: Fri Dec 12, 2003 7:22 am
Posts: 259
Location: Heerlen, NL

GARTHWILSON wrote:

I always thought they were for if you plugged in an older card or memory that couldn't keep up.

These cards had so-called "Wait state" jumpes. This could force a processor to add one (or even more) wait states in an instruction that affected the address (range) of the card.
Wiat states are still used by PCs. The faster your memory, the less wait states the CPU need. But most people haven't any idea about this. And unfortunately lots of computer shops haven't either I found out lately :(

_________________

Code:

    ___
   / __|__
  / /  |_/     Groetjes, Ruud 
  \ \__|_\
   \___|       URL: www.baltissen.org

Top

BigEd

Post subject:

Posted: Sun Aug 08, 2010 12:10 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

I'd be happy to try to do a bit of analysis on the rom: it would be an interesting challenge to try to get a heat map. If you put a hex dump on pastebin, or put the ROM on dropbox, or anywhere else, I'll have a look.

The simplest speedup is of course clock speed, so trying to put some or all of the design into an FPGA would be a big win - not a trivial task, but probably easier than a full disassembly and rewrite. If the game doesn't have more than 16k of memory, I think it would fit on a 250k gate FPGA. I think running at 40MHz would be achievable, although I'm still something of a beginner.

Another tack would be to run in an emulator on some cheap embedded board - the mbed LPC1768 has a 96MHz ARM which should be a good platform for an efficient 6502 emulation(*), for £50 (USB powered, 40-pin 0.1 inch form factor, weird online-only toolchain)

Going back to the idea of tweaking the ROM and running on a 65C02 or EC02, the 65816 might also be worth considering - it's not too hard to put one on a daughtercard. But no point going in that direction until you've some idea of the performance-limiting parts of the ROM.

Cheers
Ed

(*) edit: might be worth a separate thread - see viewtopic.php?p=10962#10962

Top

kc5tja

Post subject:

Posted: Sun Aug 08, 2010 2:55 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

Ruud wrote:

Wiat states are still used by PCs. The faster your memory, the less wait states the CPU need.

This is actually not strictly true; contemporary memory (yes, even static RAM) are all synchronous in design now-a-days, and the "speed" of a memory refers more to its bus speed than its access latency.

SxRAMs (where x=D for DRAM, S for SRAM) are built for cache-supporting CPUs, and transfer their memory in bursts ranging from 8 to 256 bytes at a time, configurable at system reset time (even before the CPU's reset is negated). The idea is to amortize the typical access latency across a multiple bytes fetched or stored, thus effectively reducing net access times. Each byte (or word, in some cases) transfers in a single clock cycle, but you still have to wait for the RAM to load the row of bytes in the first place.

Note that actual data access times have not gone down over the years; all they're doing is amortizing it. So, really, the faster the CPU, the more wait states you'll need. I'll come back to this in a moment.

Also, SxRAMs support pipelining in many cases, turning what otherwise would be wait states into useful transactions.

For a single memory transaction, the client typically does:

1. Issues the column address.
2. Issues the row address.
3. Waits 4 to 8 cycles. (Double this for DDR SxRAMs)
4. Transfers 8 to 256 bytes worth of data.

If reading in the same column, you can skip a few steps:

1. unused
2. Issues a row address.
3. Waits 2 to 4 cycles.
4. Transfers 8 to 256 bytes.

There are things called pages and banks, which also have their own (usually longer) wait times.

If you exploit pipelining, you exploit the access cycles:

1. Issues the column address 1.
2. Issues the row address 1.
3. Issues the column address 2 -- overlaps access time for chunk 1.
4. Issues the row address 2.
5. Issues the column address 3 -- overlaps access times for chunks 1 and 2.
6. Issues the row address 3.
7. Wait maybe 2 to 4 cycles, if at all.
8. Receive 8 to 256 bytes for chunk 1.
9. Receive 8 to 256 bytes for chunk 2.
10. Receive 8 to 256 bytes for chunk 3.

and so on.

The chip protocols are pretty doggone complex, actually, which is why even the simplest of homebrew projects involving lots of RAM _requires_ an FPGA at the very least to implement an SxRAM controller. But, it's the price you pay if you want a memory system that works at near-CPU speeds at anything above 20MHz these days.

Concerning the CPU speed and the number of wait-states, the reality of why computer shops don't care about this anymore is because nobody cares anymore. How can you, when computers have a plurality of wait-state inducing functions that more than dwarf the number of wait-states on a memory chip? Consider: cache misses, TLB misses, branch mis-prediction, pipeline stalls in superscalar architectures, and more all contribute way more wasted cycles than memory access typically would. With today's 7-way integer execution units, to say that a cache miss costs the CPU several thousand instructions in performance is not outside the realm of possibility.

Assuming everything works well, though, and considering the size of most cache lines and what SxRAMs can deliver in a single burst, most access times are amortized into sub-cycle latencies. If you have an 8 cycle wait time and a 64-byte burst, you're looking at an average 0.125 wait cycle latency per byte fetched. Considering other performance-affecting factors in the CPU, this isn't even worth worrying about.

Top

TMorita

Post subject:

Posted: Mon Aug 09, 2010 1:08 am

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214

kc5tja wrote:

...
Assuming everything works well, though, and considering the size of most cache lines and what SxRAMs can deliver in a single burst, most access times are amortized into sub-cycle latencies. If you have an 8 cycle wait time and a 64-byte burst, you're looking at an average 0.125 wait cycle latency per byte fetched. Considering other performance-affecting factors in the CPU, this isn't even worth worrying about.

I recommend you read Richard Sites' paper "It's the Memory, Stupid" before you embarass yourself further.

Toshi

Top

leeeeee

Post subject:

Posted: Mon Aug 09, 2010 2:16 am

Joined: Fri Aug 30, 2002 2:05 pm
Posts: 347
Location: UK

Quote:

I recommend you read Richard Sites' paper "It's the Memory, Stupid"

Got a link? It looks interesting but I can only find parts of it quoted in other papers.

Lee.

Top

kc5tja

Post subject:

Posted: Mon Aug 09, 2010 2:49 am

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

Funny, I don't feel embarrassed. Sounds like you have a personal problem with my analysis.

Top

BigEd

Post subject:

Posted: Mon Aug 09, 2010 9:45 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

leeeeee wrote:

Quote:

I recommend you read Richard Sites' paper "It's the Memory, Stupid"

Got a link? It looks interesting but I can only find parts of it quoted in other papers.

Lee.

The paper was published in "Microprocessor Report" (August 5 1996) - copyright and quite an expensive subscription.

This article has this to say:

Quote:

The 21364 processor, code-named the EV7, was described at the 1998 Microprocessor Forum; Figure 1 shows a die photo. The origins of the system design can be traced back to a 1996 column written by Alpha processor inventor Dick Sites entitled “It’s the Memory Stupid.” The article’s title was a play on a political catchphrase, but the message behind it was that memory bandwidth and latency—not just CPU core frequency—were the essential problems that needed to be solved in future generations of server processors. In that article, Sites related a performance analysis that showed that the 21164 core, running at 400MHz back then, was spending about three cycles out of every four just waiting for main memory. The problem that needed solving was not achieving faster core performance but keeping fast cores fed with data and instructions

I think perhaps Toshi's point is that Samuel implies that getting adequate bandwith to memory by pipelining accesses is good enough, but in fact if the CPU is waiting for the data, it isn't.

There's an interesting 2004 paper "Reflections on the memory wall", following on from and including a previous paper from 1994, noting that CPU clock rates continue to accelerate away from memory access times and discussing the implications.

As we're entirely and completely off-topic now, I thought it interesting that the recent "5 trillion digits of pi" calculation (using a single PC, for the most part) had this to say about ram sizes and bus speeds:

Quote:

Due to time constraints, most of the algorithms that are used for disk operations had been "forcibly ported" from the in-ram equivalents with little or no regard to performance implications. Most of these performance penalties come in the form of sub-optimal cache use.

This sub-optimal cache use leads to excessive memory-thrashing which explains why Shigeru Kondo noticed that 96 GB of ram @ 1066 MHz had higher performance than 144 GB of ram @ 800 MHz

That's a lot of address bits...

Top

fachat

Post subject:

Posted: Mon Aug 09, 2010 9:55 am

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany

GARTHWILSON wrote:

Modern 6502's can run at much higher clock rates, but that doesn't mean the rest of your hardware can handle it. Since the 65802 is no longer available, we had kicked around the idea here of making a board with a 65816 on it that could be plugged into a 6502 socket, but we never did it.

In case you didn't notice - I built such a card.
http://www.6502.org/users/andre/cbmhw/pet816/index.html
I am currently working on making it run in a 2MHz host at 2 MHz... (due to internal delays I'm not sure if I can get along without wait states in a 2MHz host)

André

Top

Page 1 of 2

[ 24 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum