The important thing to remember is that each branch site is associated with its own prediction history, which can accommodate some limited set of possible branch targets and remember some temporal patterns. The common-sense central dispatcher is vastly inferior to inlining the dispatch code at the end of each handler. This allows each handler to have its own local history of where it may go next.
As BigEd pointed out, we may have 256 opcode handlers and 256 dispatchers inlined at the end of each, which allows each opcode to keep a local history of what other opcodes may follow. But if necessary, it can be taken deeper - we could have, for instance, 64K handlers - one per possible opcode pair (in reality, many will never occur - my guess is a couple of thousand active handlers) leading to the next opcode pair, which will eliminate half the guesses and make the other half much more likely. Obviously, we can generate these automatically.
Modern CPUs are tuned for compiled code; each branch site can handle a limited number of possible targets and temporal patterns. A pure bytecode interpreter is an unpredictable abomination in such a system - with its central dispatcher, the history buffer would have to remember the entire interpreted codebase in order to predict. Inlining the dispatcher moves it in the direction of compiled code on the interpreter-compiler spectrum - while still acting as an interpreter.
[EDIT] As Darek points out, it is a tragedy that there is no x86 instruction to prime or send hints to the branch-prediction subsystem. Personally, I always wanted an 'Interpret BYtecode' instruction, kind of like the old translate byte instruction. IBY jumps through a fixed table of 256 pointers -- and the CPU is free to accelerate it by building deep prediction tables for the jumptable, taking into account the location of the executing IBY instruction site. Such a table could take a substantial percentage of CPU resources - there are not too many different bytecode interpreters running on any given CPU at any time. But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.
A MUST-READ for anyone attempting 'mulation
Re: A MUST-READ for anyone attempting 'mulation
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut
- commodorejohn
- Posts: 299
- Joined: 21 Jan 2016
- Location: Placerville, CA
- Contact:
Re: A MUST-READ for anyone attempting 'mulation
Okay, that makes a lot of sense. 'Course, the only architecture I've been writing *mulators for recently (Parallax Propeller) doesn't have branch prediction anyway...
Re: A MUST-READ for anyone attempting 'mulation
enso wrote:
But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.
Michael A.
-
DerTrueForce
- Posts: 483
- Joined: 04 Jun 2016
- Location: Australia
Re: A MUST-READ for anyone attempting 'mulation
OK, that makes more sense.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.
Oh, wait... I think I designed Simulieren in a way that I simply can't implement this... Oops.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.
Oh, wait... I think I designed Simulieren in a way that I simply can't implement this... Oops.
- commodorejohn
- Posts: 299
- Joined: 21 Jan 2016
- Location: Placerville, CA
- Contact:
Re: A MUST-READ for anyone attempting 'mulation
DerTrueForce wrote:
OK, that makes more sense.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.
Re: A MUST-READ for anyone attempting 'mulation
Goto is the appropriate low level thing, but if you think of each opcode handler as something you call, then tail call optimisation is the thing which converts such calls into jumps.
-
DerTrueForce
- Posts: 483
- Joined: 04 Jun 2016
- Location: Australia
Re: A MUST-READ for anyone attempting 'mulation
The addressing mode resolvers and the operations themselves are standalone functions in Simulieren, with some exceptions. I used the usual approach of pulling common code out into subroutines.
The thing that really kills that approach, though, is that the way it's written now, it interprets one instruction, and then returns to whatever called it. It includes no loops that I can think of off the top of my head, and that's because I intended that it be used as a library, not a standalone program.
I could fairly easily alter it to make it do multiple instructions per call, and that might open up the possibility of this optimization.
The thing that really kills that approach, though, is that the way it's written now, it interprets one instruction, and then returns to whatever called it. It includes no loops that I can think of off the top of my head, and that's because I intended that it be used as a library, not a standalone program.
I could fairly easily alter it to make it do multiple instructions per call, and that might open up the possibility of this optimization.
Re: A MUST-READ for anyone attempting 'mulation
Is there something here that a peephole optimisation could help with? Once you have
then you may be able to change the JSR to a JMP.
I do recommend a half hour reading the lib6502 source. It's very interesting, especially as it has conditional compilation for two different approaches on this particular tactic.
Code: Select all
do some stuff
find next address
read next opcode
JSR indirect to function which handles that opcode
RTS
I do recommend a half hour reading the lib6502 source. It's very interesting, especially as it has conditional compilation for two different approaches on this particular tactic.
Re: A MUST-READ for anyone attempting 'mulation
CALLs take up extra resources of RAS (return predictor), which has a limited depth. The indirection is handled using the same circuitry, so off the top of my head, same prediction for indirect JMP or CALL (although you should always verify in a correctly framed microbenchmark, which is not always the easy to come up with). Not jacking the stack also gives you the ability to use it for other purposes - maybe keeping the 6502 return stack cheaply, or storing some state along the way.
If you must CALL, what is really important is that every CALL has its return address not messed with( http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ ). JMP chaining is good by reducing the load on RAS and the microop buffer by a bit. Tail call elimination is an accepted practice - and a must in a real compiler (https://fare.livejournal.com/142410.html - another reason to not take Python seriously).
Once you've hit a wall with direct interpretation of opcodes, your best gains will probably come from doubling up on commonly used pairs, like compare/branch combinations. Short of 64K handlers, selective probing ahead may be a win for common combinations. Again, guessing, but a handler that sets, say, Z, could probe the next opcode (cheaply, since it is in the cache - or even in a register if you are clevely reading a word or even a qword but dispatching on low byte, etc.) to see if it's a branch. Perhaps there is a conditional set of the emulated PC or something that avoids an actual branch and or even a separate conditional branch - having said prediction be separated from the end-of-handler may be a win (by spreading the prediction load).
But of course, the important thing is to avoid a single dispatcher. Even if it means your code grows by megabytes. Choosing to implementing branch prediction is a pact with the Devil, and there are consequences of dealing with the hidden state for interpreter writers.
[EDIT] Come to think of it, if you do go with 64K double-opcode dispatchers (as mentioned above, the actual number is probably in the low thousands as many combinations hit literals or are not sensible... Although processing literals may be an optimization as well!)... Having a separate dispatcher for each, if it is big enough to pull in a new cache line, may be also sub-optimal. If you analyze the statistics of opcode pairs, it is possible there are (largely) disjointed sets of common from-to pairs, and a handful of predictors centering around statistical spikes may work better. Then, you can just copy as many dispatchers as needed and jump to the one that statistically aggregates the instruction being exited/entered. This looks really nuts - each dispatcher is identical, and yet, due to internal hidden state, is better for handling some instructions but not others!
If you must CALL, what is really important is that every CALL has its return address not messed with( http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ ). JMP chaining is good by reducing the load on RAS and the microop buffer by a bit. Tail call elimination is an accepted practice - and a must in a real compiler (https://fare.livejournal.com/142410.html - another reason to not take Python seriously).
Once you've hit a wall with direct interpretation of opcodes, your best gains will probably come from doubling up on commonly used pairs, like compare/branch combinations. Short of 64K handlers, selective probing ahead may be a win for common combinations. Again, guessing, but a handler that sets, say, Z, could probe the next opcode (cheaply, since it is in the cache - or even in a register if you are clevely reading a word or even a qword but dispatching on low byte, etc.) to see if it's a branch. Perhaps there is a conditional set of the emulated PC or something that avoids an actual branch and or even a separate conditional branch - having said prediction be separated from the end-of-handler may be a win (by spreading the prediction load).
But of course, the important thing is to avoid a single dispatcher. Even if it means your code grows by megabytes. Choosing to implementing branch prediction is a pact with the Devil, and there are consequences of dealing with the hidden state for interpreter writers.
[EDIT] Come to think of it, if you do go with 64K double-opcode dispatchers (as mentioned above, the actual number is probably in the low thousands as many combinations hit literals or are not sensible... Although processing literals may be an optimization as well!)... Having a separate dispatcher for each, if it is big enough to pull in a new cache line, may be also sub-optimal. If you analyze the statistics of opcode pairs, it is possible there are (largely) disjointed sets of common from-to pairs, and a handful of predictors centering around statistical spikes may work better. Then, you can just copy as many dispatchers as needed and jump to the one that statistically aggregates the instruction being exited/entered. This looks really nuts - each dispatcher is identical, and yet, due to internal hidden state, is better for handling some instructions but not others!
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut