6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 7:52 pm

All times are UTC




Post new topic Reply to topic  [ 27 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Kestrel 2 Emulation
PostPosted: Thu Feb 08, 2007 5:40 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I now have the firmware of the Kestrel 2 running with a Commodore-inspired full-screen interface (not bad!!), and it prints the following banner to the screen:

Code:
**** KESTREL FORTH ****

22080 Bytes free.

Ok.
_


This is, of course, desired. However, actually using it is painful. The keyboard handler constantly drops input events. And the CPU, which ought to run at 12.6MHz, only runs at 7 to 9MHz, depending exclusively on what the rest of the computer is doing at the time (e.g., playing an MP3 results in 7MHz wall-clock performance).

Basic performance testing shows that the MGIA video emulation is the culprit here. It needs a thorough overhaul. But, the problem is, I don't know how to do this.

I researched VICE and other emulators, but the source code is so opaque, and with an absolute lack of code-walkthrough or even a basic conceptual model document, I'm woefully lost. I can't make heads or tails of it.

Now, the MGIA emulator literally emulates a sweeping electron beam in a fixed-frequency VGA monitor, so it is the ultimate in accuracy. Of course, it's because of this that it's also the ultimate in slow (the fact that I get any kind of usable performance from it fascinates me though!!). It is so slow that the only way to get any kind of real performance is to busy-wait the software -- e.g., there are NO, except for SDL_NextEvent(), points in the code where the application can truely "wait" politely. Hence, it uses 100% CPU.

Does ANYONE here have any idea how to enhance video emulation performance? Once I conquer this, I believe that the rest of the emulator will just fall into place.

Yes, I know already that one should update only that which changes. But the MGIA's video fetch pointer is all it has. Intrinsically, it doesn't know where the beginning and end of a video bitmap is, and hence, neither does the emulator. This is why, on every vertical refresh, the host computer must reload this register. (Those familiar with Atari 8-bit computers or Commodore Amiga platforms will know exactly what I mean here). This decision was made because this requires the fewest FPGA resources to implement.

But since there are Atari 800 emulators all over the place that don't suffer from the "work harder, not smarter" syndrome that my emulator does, I know this is a solved problem. But, for the last several weeks, and after researching emulator source code, I'm just plain clueless on how to go about doing this.

HELP?!!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Feb 08, 2007 10:36 pm 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
I'm not sure exactly what the question is. If it's an issue of inefficiency due to emulating each component one little step at a time, then switching to another, then the "catch up" model can help a lot. The basic idea is to run one component without interruption for as long as possible. Things that require an interruption include interaction with another component or another component's interaction with the one running. As long as you can predict these interactions in advance for at least all but one component, then it works.

As an example, the graphics chip's behavior should be fairly easily predictable in advance, given that nothing else disturbs it. The CPU's behavior generally cannot be predicted in advance. So you run the CPU non-stop until it either interacts with another component, or another component would generate an interrupt. If the CPU interacts with the graphics chip, suspend CPU emulation just before the interaction, run the graphics chip to the present CPU time, update prediction of its next effect on the CPU (if any), then do the interaction and continue CPU emulation. Similar goes for an interrupt, where you previously predict when it will occur and use that time as the clock limit in the CPU emulator loop.

What results are substantial unbroken runs of emulation. This allows you to add optimizations for common bottleneck cases, like emulating a complete scanline without interruption. It also means that each component's emulation loop doesn't have to constantly poll other hardware, since the times of interaction will be known in advance. Most loops can have a single unchanging limit condition.

Also, how the heck are you decoding an mp3 using just 2 MHz of time on a 65816?!? Is this stereo 16-bit 44/48 kHz?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Feb 08, 2007 11:09 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
First, let me start out by saying that what I'm writing is what's on my mind, and has not been 100% thought through. It sounds too harsh to me -- much harsher than what I'd normally write. I apologize in advance. BUT, if I don't get it down now, then I'll never have it available to refine later on (I'm currently at work with lots of managers floating about, so I have to be fast.)

I'd rather the data be here and in potentially misunderstood/offensive form than not at all an dgone forever. :)



Quote:
The basic idea is to run one component without interruption for as long as possible. Things that require an interruption include interaction with another component or another component's interaction with the one running. As long as you can predict these interactions in advance for at least all but one component, then it works.


Except you can't predict this. Ever.

The emulator has no knowledge of what my software is going to do. Nor when. Bugs could cause the video fetch pointer to be altered unpredictably, etc. Demo coders are *KNOWN* for doing things unpredictably.

Quote:
As an example, the graphics chip's behavior should be fairly easily predictable in advance, given that nothing else disturbs it. The CPU's


Except that, by its nature, the CPU is seemingly always disturbing it!!

Quote:
So you run the CPU non-stop until it either interacts with another component, or another component would generate an interrupt. If the CPU interacts with the graphics chip, suspend CPU emulation just before the interaction, run the graphics chip to the present CPU time,


There is a problem with this. To fetch data from RAM, the video chip has a DMA channel with an address register. The DMA channel's address register is the only thing that determines what data is fetched. And, it only increments (to keep the hardware the simplest possible). It never resets to any base value under hardware control.

What this means is, if not touched by the CPU, that counter will eventually cycle through all 16MB of the CPU's address space.

So, what you're proposing is that whenever the CPU writes to RAM at an address >= the current video fetch pointer, then it ought to kick off an MGIA catch-up. But, what if it writes it megabytes above the current video frame data? Every write access, then, would result in an MGIA catch up.

Another issue: pretend that we disable (raster) interrupts, thus preventing the reload of the video fetch pointer, and enter into a tight loop:

Code:
   STI
a: JMP a


This will hard-lock the computer. Yet, the display now should continuously reflect the state of the (now forever incrementing) video fetch pointer. Therefore, the CPU emulation code still needs to be interrupted at regular intervals to allow this to happen.

How can this be integrated into the system?

Quote:
. . .emulation loop doesn't have to constantly poll other hardware, since the times of interaction will be known in advance. Most loops can have a single unchanging limit condition.


But these interactions can never be known in advance. That is the whole problem!

Quote:
Also, how the heck are you decoding an mp3 using just 2 MHz of time on a 65816?!? Is this stereo 16-bit 44/48 kHz?


uhh...huh? It's an emulator. It runs under Linux, and xmms (under Linux) is playing the MP3. The emulator is impacted by this.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 1:17 am 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
Disclaimer noted. Maybe the lingering numbness from my dental visit today will help. :)

Here's my own disclaimer: my comments may not apply to this particular emulator, since I don't know the specifics. I only offer techniques which I know to work for some hardware, even in the face of user code doing "demo-style" things. This approach is also not trivial, but the speed it offers where it works is quite hard to beat. If at any point this approach seems to have become irrelevant to your original question, I'll drop the subject, since my only aim is useful ideas.

Quote:
The emulator has no knowledge of what my software is going to do. Nor when. Bugs could cause the video fetch pointer to be altered unpredictably, etc. Demo coders are *KNOWN* for doing things unpredictably.

By prediction, I mean what the hardware will do, not the user code. As I mentioned, you can't predict what the code will do, so the CPU has to be a sort of master in the scheme.
Quote:
Except that, by its nature, the CPU is seemingly always disturbing it!!

If the code is literally constantly writing to the video chip, what I described won't improve speed, but if it's like most code, doing some writes then a lot of processing inbetween, it will help speed.
Quote:
But these interactions can never be known in advance. That is the whole problem!

Well yeah, if your system has components that defy prediction, or are very complex, then the "catch up" model won't work.

Not that this is necessarily a solution, but in the DMA example, you can still predict the range of addresses that the DMA pointer will cover in a given short time period (say, 1/60 second), then limit the CPU execution to this time period and take special action if any writes are made to the region the pointer will cover. If emulated memory accesses go through a table, you could simply change the entries corresponding to the DMA region, allowing no extra overhead unless the CPU is accessing that region. This is also an example of the general approach: find a simple way to handle the common case efficiently while still being able to handle the rare ones correctly.

Quote:
[infinite loop with interrupts disabled]This will hard-lock the computer. Yet, the display now should continuously reflect the state of the (now forever incrementing) video fetch pointer. Therefore, the CPU emulation code still needs to be interrupted at regular intervals to allow this to happen.

One periodic event that would interrupt CPU emulation (but not otherwise affect the CPU) would be to run the video chip every 1/60 second, for sure.

Quote:
uhh...huh? It's an emulator. It runs under Linux, and xmms (under Linux) is playing the MP3. The emulator is impacted by this.

Ohhh. The first paragraph threw me off, where you mentioned a keyboard handler dropping events, which I figured meant the handler in your Kestrel software. And the CPU speed I thought was a reference to some memory tolerances not being up to spec, so you had to lower your clock for stability. Heh. Makes sense now, all about the emulator's issues. Given how hard-core your design and coding sounds to be from other messages you posted, I couldn't rule out that maybe you were decoding an mp3 somehow.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 1:33 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Ahh, no. :)

The emulator is a tight, but fairly brute force, loop.

I see what you mean now. You're correct; the DMA fetch pointer will definitely allow us to provide a fixed window. Here's my logic -- see if it makes sense to you:

* Writing to any MGIA chip register (which necessarily includes the video data fetch pointer, VIDAPT) will force an MGIA catch-up, since all subsequent fetches will result in a new image being displayed. The emulator will also cache new values of VIDAPT in a background register VIDAST (Video Data Start)

* Writing any memory in the range VIDAPT...VIDAST+38399 (yes, that's between the current data pointer to video data start+size of the largest bitmap the hardware supports) will force MGIA catch-up.

* Changing the VRLI (VeRtical raster compare LIne) register necessarily alters when the next IRQ will occur, if it hasn't occured already. Cancel any outstanding "upcoming events" related to this, and reschedule them accordingly.

* Even so, the vertical retrace event occurs, with or without 65816 notification. Ergo, it still must interrupt CPU emulation to take care of business. Part of this business is, in fact, updating VIDAST to the current value of VIDAPT, thus allowing it to properly emulate software that deliberately lets VIDAPT run over multiple bitmaps (cheap way to get double buffering!).

What do you think?

Another issue: suppose I make these changes, and I end up with a 4x to 6x improvement in emulation performance. While I'd like to have a 25MHz 65816 to play with, the reality is, the hardware will only run at 12.6MHz. :) What methods are there to properly throttle the emulator to control this?

Thanks for the feedback.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 3:15 am 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
Quote:
Writing to any MGIA chip register (which necessarily includes the video data fetch pointer, VIDAPT) will force an MGIA catch-up, since all subsequent fetches will result in a new image being displayed. [...]
Writing any memory in the range VIDAPT...VIDAST+38399 [...] will force MGIA catch-up.

Exactly. Start out simple and treat any access of the chip as potentially changing its behavior, then make further distinctions if performance dictates.
Quote:
Even so, the vertical retrace event occurs, with or without 65816 notification. Ergo, it still must interrupt CPU emulation to take care of business.

Or it could do these internal side-effects on the next catch-up. But the CPU interrupt has a snag: the I flag. If it requires CPU acknowledgement, then if the I flag is set when the interrupt occurs, the CPU must be sure to stop immediately if a CLI occurs (actually, it would stop one instruction after the CLI, at least the 6502 does). But this is manageable, by having the CPU keep track of the time of the next IRQ as well as the time it must stop emulating no matter what.
Quote:
What do you think?

You definitely get the idea. The question now is whether it's worth implementing, or whether simpler, more local optimizations can be made to the graphics code.
Quote:
While I'd like to have a 25MHz 65816 to play with, the reality is, the hardware will only run at 12.6MHz. What methods are there to properly throttle the emulator to control this?

Each 1/60 emulated second when you update the host display, first wait until it's been at least 1/60 host second since the last update, or something equivalent. If there were no video display, you could still interrupt CPU emulation N times per emulated second and wait until the same amount of real time has passed. One really nice thing about a fast emulated CPU is the ability to quickly run thorough test code in the emulated environment. You could have it exhaustively test a function with every possible input value or whatever.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 3:49 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
blargg wrote:
Or it could do these internal side-effects on the next catch-up.


Not possible -- if the CPU deadlocks, then the screen would never update. (Remember, the DMA pointer is free-wheeling without the vertical fresh interrupt!)

Quote:
But the CPU interrupt has a snag: the I flag.


Uh oh -- read again; I wrote interrupt CPU emulation. The CPU itself is not executing an interrupt service routine. That only happens on a successful raster compare event.

Quote:
If it requires CPU acknowledgement, then if the I flag is set when the interrupt occurs, the CPU must be sure to stop immediately if a CLI occurs (actually, it would stop one instruction after the CLI, at least the 6502 does). But this is manageable, by having the CPU keep track of the time of the next IRQ as well as the time it must stop emulating no matter what.


The CPU emulation core already deals with this. It's a modified core from the failed xgs project.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 5:07 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Blargg,

I reworked update_mgia() to run only when the state of the chip or of the bitmap-to-be-displayed changes. The result is an emulator that actually runs SLOWER by a factor of 4. I cannot explain why this happens, but the impact was immense.

Blargg, it looks like you've worked on the atari800 emulator. I'd like to point out that the ANTIC chip kinda sorta works analogously to my MGIA. (Well, more accurately, it's closer to the Amiga's Agnus chip, but I digress; for this discussion they're the same). How does ANTIC's emulator work?

I tried reviewing the source code, but there are so many #ifdef blocks that I can't make heads or tails of it. I've been able to ascertain the following though, but I can't be sure it's correct:

1) At the start of each "frame" (I'm assuming an atari800 emulation frame is one scanline), the scanline-to-be-displayed is prefetched into a holding buffer, almost like how the Atari 7800's display list processor works. Note that fetching into this buffer always implies color expansion during the prefetch as well.

2) Also at the beginning of each frame, a "pixels displayed so far" counter, xpos, is reset to zero.

3) Each CPU clock cycle spans 1 pixel width on the screen; hence, xpos is incremented every emulated CPU cycle.

4) If any change to video chip registers are made, the scanline is refetched starting from xpos and continuing to the end.

5) I'd like to think that any memory writes into the bitmap-yet-to-be-displayed, starting at xpos and continuing to the end of the current scanline, is mirrored automatically in the internal cache as well as RAM. (No need to bother with anything more, since it'll just be prefetched during the next ANTIC_frame() call anyway.) However, I don't see where this is actually implemented.

Then, at each scanline interval, the scanline cache is blitted into SDL's framebuffer. Every full video frame, SDL_Flip() is called to display it.

Is this correct? Thanks for any clarifications you can offer.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 6:15 pm 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
Quote:
I reworked update_mgia() to run only when the state of the chip or of the bitmap-to-be-displayed changes. The result is an emulator that actually runs SLOWER by a factor of 4. I cannot explain why this happens, but the impact was immense.

Very weird. Can I take a look at the source? I found the MGIA documentation in your Kestrel FIRMWARE.pdf document and I like the simplicity of the chip.

Quote:
it looks like you've worked on the atari800 emulator.

Actually I gave them code to emulate the composite NTSC video signal and TV decoder artifacts. I just looked at that ANTIC source and it's quite a beast.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 7:35 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Although it's an "open source" package, I just haven't been bothered to put it up in a public place yet. :)

Let me bundle what I have *now* up -- it's broken in the sense that it's slower than what it used to be.

Requirements: it runs on Linux under X11. A friend of mine got it to compile under Windows once, but never sent me the source patches, so I apologize if it won't compile out of the box for Windows.

Do not use VNC to run it and expect to use the keyboard -- the keycode mappings are totally different from what the firmware expects. It's likely this will happen with Windows too. However, it should run despite this -- you just won't be able to type anything meaningful.

NOTE: As a test of video performance, I have the firmware dumping the first 8K of ROM contents to the screen, before it displays its final banner.

You can grab the source archive here:

http://www.falvotech.com/tmp/k2.tar.gz


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 9:04 pm 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
I don't have an easy way to compile the code, but I can make a pretty good guess that most of the time is spent rendering the main scanline pixels. Caching unchanged scanlines shouldn't be necessary on any PC made in the last 10 years. Here are two optimizations. The first is a minor one to remove the redundant counters. You might also get better performance by making local copies of the most often used global variables, like x_counter, since the compiler doesn't know that these won't be modified by any of the functions called.

Code:
word32 elapsed_pixels = (timestamp-last_update) << 1;

if( elapsed_pixels >= pixelsPerScreen )
    elapsed_pixels = pixelsPerScreen;

word32 pixel_count = x_counter + elapsed_pixels;
while ( x_counter < pixel_count )
{
    ...
    x_counter++;

    if( x_counter >= mgia.CCHZTL )
    {
        pixel_count -= x_counter;
        x_counter = 0;
        mgia.CCVRLI++;
        ...
    }
}


This second optimization should be significant. It should result in optimal rendering for most active pixels in a scanline. I also show further optimization in reading the source pixels, for what I assume will be the common case out of RAM without crossing a mirror boundary. Originally every pixel would require a run through the entire loop, checking many conditions over and over. Note that I don't have an easy way to test either of these, but you should get the idea behind it.

Code:
else if( mgia.CCVRLI < mgia.VIVREN )
{
    int count = min( mgia.VIHZEN - x_counter, mgia.CCHZTL - x_counter );
    if ( count >= 8 && bits_to_shift == 0 )
    {
        // Optimize common case of rendering 8 or more pixels when none are left
        // in internal shift register
       
        // making this static might reduce efficiency
        word32 colors [2] = { BLACK, WHITE };
       
        // Make local copy of beam pointer and tell compiler that writes to it
        // won't overlap with emulator memory
        word32* restrict out = virtual_beam;
       
        count &= ~7;
        x_counter += count;
        virtual_beam += count;
       
        // Even more optimal would be to get a pointer directly to emulator RAM,
        // since MEM_readMem() does a few checks and then calls another function.
        // This would require checking that the range VIDAPT to VIDAPT+count-1
        // is entirely in one mirror of RAM. If this doesn't hold, then the
        // unoptimized handler below should be used. Restrict tells compiler that
        // nothing else will modify data read via this pointer.
        //byte const* restrict in     = &ram [mgia.VIDAPT];
        //byte const*          in_end = in + count;
       
        word32 in     = mgia.VIDAPT;
        word32 in_end = in + count;
       
        mgia.VIDAPT += count;
     
        // render 8 pixels at a time
        do
        {
            //int t = shifter = *in++;
            int t = MEM_readMem( in++, timestamp );
           
            // write in reverse order to reduce shifting
            out [7] = colors [t & 1]; t >>= 1;
            out [6] = colors [t & 1]; t >>= 1;
            out [5] = colors [t & 1]; t >>= 1;
            out [4] = colors [t & 1]; t >>= 1;
            out [3] = colors [t & 1]; t >>= 1;
            out [2] = colors [t & 1]; t >>= 1;
            out [1] = colors [t & 1]; t >>= 1;
            out [0] = colors [t & 1];
            out += 8;
        }
        while ( in < in_end );
       
        virtual_beam = out;
    }
    else
    {
        if( bits_to_shift == 0 )
        {
            bits_to_shift = 8;
            mgia.unused0 = MEM_readMem( mgia.VIDAPT, timestamp );
            mgia.VIDAPT++;
        }

        if( mgia.unused0 & 0x80 ) { EMIT( WHITE ); }
        else                      { EMIT( BLACK ); }

        mgia.unused0 <<= 1;
        bits_to_shift--;
    }
}


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 10:09 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I'm not so sure this will help, really -- it's still doing the color expansion on the fly, all the time. Which means that mgia_update() will still be consuming the overwhelming majority of the emulator's time.

I think that "dirty line" support is still a requirement to get good performance.

I'll try it later tonight when I get home, but I'm not convinced at this point that it'll work to my favor.

Also, I'm not sure what you mean by "minimizing shifting" in your source code command. Your loop isn't minimizing shifting, but it is minimizing the number of times decision logic has to be executed. I can definitely see how that could speed stuff up.

I should do an experiment to measure the impact of an if() statement in C code, and actually see if I can predict the speedup factor of minimizing conditional execution.

Another thing I was thinking of doing was using jump tables (well, tables of function pointers are the closest I can come in C), and using function pointers to maintain the MGIA's current state. This ought to further minimize conditional logic too.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Feb 09, 2007 11:48 pm 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
Quote:
I'm not so sure this will help, really -- it's still doing the color expansion on the fly, all the time.

You mean where it writes WHITE or BLACK to the output buffer? That conversion is pretty minimal, though you could reduce memory bandwidth by using an 8-bit buffer. This would eliminate the lookup too:

out [0] = t & 1;

What would help most is a profile to see what is the most common block in the update function. I'm assuming it's the one I provided an optimization for, as that seems the one that draws the main pixels. The optimized version should run lightning fast on a modern PC. What kind of system is this for, and is compiler optimization enabled for mgia.c?
Quote:
Also, I'm not sure what you mean by "minimizing shifting" in your source code command. Your loop isn't minimizing shifting,

I was comparing writing each group of 8 pixels "backwards" versus forwards for the brain-dead x86 architecture. I think the following code might involve more register copies or something on x86 (it would make no difference on a modern RISC CPU, like ARM, PowerPC, etc.). It's minor either way.
Code:
out [0] = colors [t >> 7 & 1];
out [1] = colors [t >> 6 & 1];
out [2] = colors [t >> 5 & 1];
out [3] = colors [t >> 4 & 1];
out [4] = colors [t >> 3 & 1];
out [5] = colors [t >> 2 & 1];
out [6] = colors [t >> 1 & 1];
out [7] = colors [t >> 0 & 1];

Quote:
I should do an experiment to measure the impact of an if() statement in C code, and actually see if I can predict the speedup factor of minimizing conditional execution.

Basically if it's generally taken or generally not taken, the processor will remember this and be able to eliminate its impact. In this case the branch would be taken 7/8 of the time, so it'd fare fairly well. But unrolling the loop as above allows more parallelization as well as eliminating the branch.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Feb 10, 2007 12:09 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Since the active display is the largest area of the screen, it will be the most frequency executed. The borders are, at best, 25% of what MGIA has to do.

I was considering going to a 256-color display, as you outlined above, but I was concerned for SDL's own color-expansion capabilities. I guess it doesn't hurt to try it.

I still would really like to know how the atari800's antic.c does it.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Feb 10, 2007 1:05 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
kc5tja wrote:
Quote:
The basic idea is to run one component without interruption for as long as possible. Things that require an interruption include interaction with another component or another component's interaction with the one running. As long as you can predict these interactions in advance for at least all but one component, then it works.


Except you can't predict this. Ever.


Actually you can. VICE does it this way. It has an internal "alarm" feature that interrupts CPU emulation whenever a device needs to change its internal state, but in between the CPU can run undisturbed. Alarms for example are rasterlines, VIA/CIA timer events etc.

Quote:
The emulator has no knowledge of what my software is going to do. Nor when. Bugs could cause the video fetch pointer to be altered unpredictably, etc. Demo coders are *KNOWN* for doing things unpredictably.


When the CPU updates any of the registers (be it CRTC, VIA, CIA or anything else), the internal alarm queue (yes, actually a queue not a single point in time value) is updated for the chip. For example if the CPU sets VIA timer to expire in 1000 cycles, the alarm is set to 1000 cycles from this point. If the CPU happens to modify the timer again after 900 cycles, then the alarm value for the VIA is updated and requeued into the alarm queue.

Quote:
Quote:
As an example, the graphics chip's behavior should be fairly easily predictable in advance, given that nothing else disturbs it. The CPU's


Except that, by its nature, the CPU is seemingly always disturbing it!!


What is a graphics hardware worth if it keeps the CPU busy 100%. That's, well, not efficient.

Quote:
Quote:
So you run the CPU non-stop until it either interacts with another component, or another component would generate an interrupt. If the CPU interacts with the graphics chip, suspend CPU emulation just before the interaction, run the graphics chip to the present CPU time,


There is a problem with this. To fetch data from RAM, the video chip has a DMA channel with an address register. The DMA channel's address register is the only thing that determines what data is fetched. And, it only increments (to keep the hardware the simplest possible). It never resets to any base value under hardware control.

What this means is, if not touched by the CPU, that counter will eventually cycle through all 16MB of the CPU's address space.

So, what you're proposing is that whenever the CPU writes to RAM at an address >= the current video fetch pointer, then it ought to kick off an MGIA catch-up. But, what if it writes it megabytes above the current video frame data? Every write access, then, would result in an MGIA catch up.


VICE for example uses (IIRC) a block-based pointer table for memory accesses. If the address is occupied by memory, reads and writes directly go to memory. Otherwise a special function is called (like "read_io(ADDRESS char)").
For your DMA, you could have block-based alarms (i.e. every time the DMA crosses a page boundary trigger an alarm) and the alarm routine could replace the pointer to memory with a function pointer for only that page that it actually accesses, and restore the old one for the page it just left. The function could then automatically catch up with DMA, yet only when it happens in the very same page.


Quote:
Another issue: pretend that we disable (raster) interrupts, thus preventing the reload of the video fetch pointer, and enter into a tight loop:

Code:
   STI
a: JMP a


This will hard-lock the computer. Yet, the display now should continuously reflect the state of the (now forever incrementing) video fetch pointer. Therefore, the CPU emulation code still needs to be interrupted at regular intervals to allow this to happen.

How can this be integrated into the system?


Exactly how VICE does this for example. Every opcode fetch compares the current cycle time with the low end of the alarm queue, which is a single compare and quite fast. If the alarm happened during the last opcode (or is about to happen during the next one, don't remember exactly) the alarm routine is called.

In fact each "write_*()" function has access to the maincpu clock value (the "maincpu_clk" variable in VICE represents the clock cycle of the opcode fetch, but the routine gets an offset to this as well) so that cycle-exact emulation can be done in the I/O access functions.

Quote:
Quote:
. . .emulation loop doesn't have to constantly poll other hardware, since the times of interaction will be known in advance. Most loops can have a single unchanging limit condition.


But these interactions can never be known in advance. That is the whole problem!


Sorry, but only if those interactions appear "out-of-system". I.e. a key-press for example. But VICE for example has IIRC an alarm (or does it during rasterline alarm) where the host operating system is queried for new events.

Any other interaction - when triggered by the device (like VIA) - is predictable, and the point in time can be computed, so it can be checked in the CPU loop that does the only non-predictable stuff.

Quote:
Quote:
Also, how the heck are you decoding an mp3 using just 2 MHz of time on a 65816?!? Is this stereo 16-bit 44/48 kHz?


uhh...huh? It's an emulator. It runs under Linux, and xmms (under Linux) is playing the MP3. The emulator is impacted by this.

Yes, I was wondering that too ;-)

André

P.S.: I used to be a core VICE emulator. And it took a real while to make it one of the best emulators available, so don't worry. I will be preparing a VICE patch to emulate my CS/A65 systems as well when I have time.

I have to admit, the VICE code is difficult to understand when not "initiated", but if you have any questions, ask me.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 27 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 75 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: