6502.org • View topic - A MUST-READ for anyone attempting 'mulation

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

A MUST-READ for anyone attempting 'mulation

Page 2 of 2

[ 24 posts ]

Go to page Previous 1, 2

Previous topic | Next topic

Author

Message

enso

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sat Dec 07, 2019 7:56 pm

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

The important thing to remember is that each branch site is associated with its own prediction history, which can accommodate some limited set of possible branch targets and remember some temporal patterns. The common-sense central dispatcher is vastly inferior to inlining the dispatch code at the end of each handler. This allows each handler to have its own local history of where it may go next.

As BigEd pointed out, we may have 256 opcode handlers and 256 dispatchers inlined at the end of each, which allows each opcode to keep a local history of what other opcodes may follow. But if necessary, it can be taken deeper - we could have, for instance, 64K handlers - one per possible opcode pair (in reality, many will never occur - my guess is a couple of thousand active handlers) leading to the next opcode pair, which will eliminate half the guesses and make the other half much more likely. Obviously, we can generate these automatically.

Modern CPUs are tuned for compiled code; each branch site can handle a limited number of possible targets and temporal patterns. A pure bytecode interpreter is an unpredictable abomination in such a system - with its central dispatcher, the history buffer would have to remember the entire interpreted codebase in order to predict. Inlining the dispatcher moves it in the direction of compiled code on the interpreter-compiler spectrum - while still acting as an interpreter.

[EDIT] As Darek points out, it is a tragedy that there is no x86 instruction to prime or send hints to the branch-prediction subsystem. Personally, I always wanted an 'Interpret BYtecode' instruction, kind of like the old translate byte instruction. IBY jumps through a fixed table of 256 pointers -- and the CPU is free to accelerate it by building deep prediction tables for the jumptable, taking into account the location of the executing IBY instruction site. Such a table could take a substantial percentage of CPU resources - there are not too many different bytecode interpreters running on any given CPU at any time. But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

commodorejohn

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sat Dec 07, 2019 8:47 pm

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 269
Location: Placerville, CA

Okay, that makes a lot of sense. 'Course, the only architecture I've been writing *mulators for recently (Parallax Propeller) doesn't have branch prediction anyway...

Top

MichaelM

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sat Dec 07, 2019 10:09 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

enso wrote:

But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.

The Burroughs B1700 was a significant attempt at satisfying the idea you expressed. Lack of speed for the microprogram store, the operating memory, and the arithmetic logic limited their performance. They could, however, simultaneously host several "interpreters".

_________________
Michael A.

Top

DerTrueForce

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 5:05 am

Joined: Sat Jun 04, 2016 10:22 pm
Posts: 483
Location: Australia

OK, that makes more sense.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.

Oh, wait... I think I designed Simulieren in a way that I simply can't implement this... Oops.

Top

commodorejohn

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 5:24 am

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 269
Location: Placerville, CA

DerTrueForce wrote:

If your handlers are functions unto themselves, yeah. You'd need a language with a proper computed-GOTO facility to really implement such a thing properly, which standard C doesn't have (GCC adds some kind of extension for this, but of course then you're off in the weeds with proprietary stuff.)

Top

BigEd

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 8:09 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England

Goto is the appropriate low level thing, but if you think of each opcode handler as something you call, then tail call optimisation is the thing which converts such calls into jumps.

Top

DerTrueForce

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 12:01 pm

Joined: Sat Jun 04, 2016 10:22 pm
Posts: 483
Location: Australia

The addressing mode resolvers and the operations themselves are standalone functions in Simulieren, with some exceptions. I used the usual approach of pulling common code out into subroutines.

The thing that really kills that approach, though, is that the way it's written now, it interprets one instruction, and then returns to whatever called it. It includes no loops that I can think of off the top of my head, and that's because I intended that it be used as a library, not a standalone program.
I could fairly easily alter it to make it do multiple instructions per call, and that might open up the possibility of this optimization.

Top

BigEd

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 1:49 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England

Is there something here that a peephole optimisation could help with? Once you have

Code:

do some stuff
find next address
read next opcode
JSR indirect to function which handles that opcode
RTS

then you may be able to change the JSR to a JMP.

I do recommend a half hour reading the lib6502 source. It's very interesting, especially as it has conditional compilation for two different approaches on this particular tactic.

Top

enso

Post subject: Re: A MUST-READ for anyone attempting 'mulation

Posted: Sun Dec 08, 2019 7:09 pm

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

CALLs take up extra resources of RAS (return predictor), which has a limited depth. The indirection is handled using the same circuitry, so off the top of my head, same prediction for indirect JMP or CALL (although you should always verify in a correctly framed microbenchmark, which is not always the easy to come up with). Not jacking the stack also gives you the ability to use it for other purposes - maybe keeping the 6502 return stack cheaply, or storing some state along the way.

If you must CALL, what is really important is that every CALL has its return address not messed with( http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ ). JMP chaining is good by reducing the load on RAS and the microop buffer by a bit. Tail call elimination is an accepted practice - and a must in a real compiler (https://fare.livejournal.com/142410.html - another reason to not take Python seriously).

Once you've hit a wall with direct interpretation of opcodes, your best gains will probably come from doubling up on commonly used pairs, like compare/branch combinations. Short of 64K handlers, selective probing ahead may be a win for common combinations. Again, guessing, but a handler that sets, say, Z, could probe the next opcode (cheaply, since it is in the cache - or even in a register if you are clevely reading a word or even a qword but dispatching on low byte, etc.) to see if it's a branch. Perhaps there is a conditional set of the emulated PC or something that avoids an actual branch and or even a separate conditional branch - having said prediction be separated from the end-of-handler may be a win (by spreading the prediction load).

But of course, the important thing is to avoid a single dispatcher. Even if it means your code grows by megabytes. Choosing to implementing branch prediction is a pact with the Devil, and there are consequences of dealing with the hidden state for interpreter writers.

[EDIT] Come to think of it, if you do go with 64K double-opcode dispatchers (as mentioned above, the actual number is probably in the low thousands as many combinations hit literals or are not sensible... Although processing literals may be an optimization as well!)... Having a separate dispatcher for each, if it is big enough to pull in a new cache line, may be also sub-optimal. If you analyze the statistics of opcode pairs, it is possible there are (largely) disjointed sets of common from-to pairs, and a handful of predictors centering around statistical spikes may work better. Then, you can just copy as many dispatchers as needed and jump to the one that statistically aggregates the instruction being exited/entered. This looks really nuts - each dispatcher is identical, and yet, due to internal hidden state, is better for handling some instructions but not others!

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

Page 2 of 2

[ 24 posts ]

Go to page Previous 1, 2

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 4 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum