6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 9:05 am

All times are UTC




Post new topic Reply to topic  [ 24 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Sat Dec 07, 2019 7:56 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
The important thing to remember is that each branch site is associated with its own prediction history, which can accommodate some limited set of possible branch targets and remember some temporal patterns. The common-sense central dispatcher is vastly inferior to inlining the dispatch code at the end of each handler. This allows each handler to have its own local history of where it may go next.

As BigEd pointed out, we may have 256 opcode handlers and 256 dispatchers inlined at the end of each, which allows each opcode to keep a local history of what other opcodes may follow. But if necessary, it can be taken deeper - we could have, for instance, 64K handlers - one per possible opcode pair (in reality, many will never occur - my guess is a couple of thousand active handlers) leading to the next opcode pair, which will eliminate half the guesses and make the other half much more likely. Obviously, we can generate these automatically.

Modern CPUs are tuned for compiled code; each branch site can handle a limited number of possible targets and temporal patterns. A pure bytecode interpreter is an unpredictable abomination in such a system - with its central dispatcher, the history buffer would have to remember the entire interpreted codebase in order to predict. Inlining the dispatcher moves it in the direction of compiled code on the interpreter-compiler spectrum - while still acting as an interpreter.

[EDIT] As Darek points out, it is a tragedy that there is no x86 instruction to prime or send hints to the branch-prediction subsystem. Personally, I always wanted an 'Interpret BYtecode' instruction, kind of like the old translate byte instruction. IBY jumps through a fixed table of 256 pointers -- and the CPU is free to accelerate it by building deep prediction tables for the jumptable, taking into account the location of the executing IBY instruction site. Such a table could take a substantial percentage of CPU resources - there are not too many different bytecode interpreters running on any given CPU at any time. But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 07, 2019 8:47 pm 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 282
Location: Placerville, CA
Okay, that makes a lot of sense. 'Course, the only architecture I've been writing *mulators for recently (Parallax Propeller) doesn't have branch prediction anyway...


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 07, 2019 10:09 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
enso wrote:
But the ability to virtualize bytecoded CPUs at gigacycles/sec would alter the computing landscape, bringing us closer to what computing should be (see http://ngnghm.github.io). Imagine being able to create Domain-Specific virtual CPUs (running at full speed) to solve problems, taking DSLs to their ultimate conclusion.

The Burroughs B1700 was a significant attempt at satisfying the idea you expressed. Lack of speed for the microprogram store, the operating memory, and the arithmetic logic limited their performance. They could, however, simultaneously host several "interpreters".

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 5:05 am 
Offline

Joined: Sat Jun 04, 2016 10:22 pm
Posts: 483
Location: Australia
OK, that makes more sense.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.

Oh, wait... I think I designed Simulieren in a way that I simply can't implement this... Oops.


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 5:24 am 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 282
Location: Placerville, CA
DerTrueForce wrote:
OK, that makes more sense.
I suppose I could wrap the main dispatcher switch in a macro and stick that at the end of each instruction, but wouldn't that effectively turn into recursion without a base case, and blow the stack after a while? Crashing after a few dozen insructions doesn't strike me as being such a great idea.

If your handlers are functions unto themselves, yeah. You'd need a language with a proper computed-GOTO facility to really implement such a thing properly, which standard C doesn't have (GCC adds some kind of extension for this, but of course then you're off in the weeds with proprietary stuff.)


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 8:09 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Goto is the appropriate low level thing, but if you think of each opcode handler as something you call, then tail call optimisation is the thing which converts such calls into jumps.


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 12:01 pm 
Offline

Joined: Sat Jun 04, 2016 10:22 pm
Posts: 483
Location: Australia
The addressing mode resolvers and the operations themselves are standalone functions in Simulieren, with some exceptions. I used the usual approach of pulling common code out into subroutines.

The thing that really kills that approach, though, is that the way it's written now, it interprets one instruction, and then returns to whatever called it. It includes no loops that I can think of off the top of my head, and that's because I intended that it be used as a library, not a standalone program.
I could fairly easily alter it to make it do multiple instructions per call, and that might open up the possibility of this optimization.


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 1:49 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Is there something here that a peephole optimisation could help with? Once you have
Code:
do some stuff
find next address
read next opcode
JSR indirect to function which handles that opcode
RTS
then you may be able to change the JSR to a JMP.

I do recommend a half hour reading the lib6502 source. It's very interesting, especially as it has conditional compilation for two different approaches on this particular tactic.


Top
 Profile  
Reply with quote  
PostPosted: Sun Dec 08, 2019 7:09 pm 
Offline
User avatar

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 904
CALLs take up extra resources of RAS (return predictor), which has a limited depth. The indirection is handled using the same circuitry, so off the top of my head, same prediction for indirect JMP or CALL (although you should always verify in a correctly framed microbenchmark, which is not always the easy to come up with). Not jacking the stack also gives you the ability to use it for other purposes - maybe keeping the 6502 return stack cheaply, or storing some state along the way.

If you must CALL, what is really important is that every CALL has its return address not messed with( http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ ). JMP chaining is good by reducing the load on RAS and the microop buffer by a bit. Tail call elimination is an accepted practice - and a must in a real compiler (https://fare.livejournal.com/142410.html - another reason to not take Python seriously).

Once you've hit a wall with direct interpretation of opcodes, your best gains will probably come from doubling up on commonly used pairs, like compare/branch combinations. Short of 64K handlers, selective probing ahead may be a win for common combinations. Again, guessing, but a handler that sets, say, Z, could probe the next opcode (cheaply, since it is in the cache - or even in a register if you are clevely reading a word or even a qword but dispatching on low byte, etc.) to see if it's a branch. Perhaps there is a conditional set of the emulated PC or something that avoids an actual branch and or even a separate conditional branch - having said prediction be separated from the end-of-handler may be a win (by spreading the prediction load).

But of course, the important thing is to avoid a single dispatcher. Even if it means your code grows by megabytes. Choosing to implementing branch prediction is a pact with the Devil, and there are consequences of dealing with the hidden state for interpreter writers.

[EDIT] Come to think of it, if you do go with 64K double-opcode dispatchers (as mentioned above, the actual number is probably in the low thousands as many combinations hit literals or are not sensible... Although processing literals may be an optimization as well!)... Having a separate dispatcher for each, if it is big enough to pull in a new cache line, may be also sub-optimal. If you analyze the statistics of opcode pairs, it is possible there are (largely) disjointed sets of common from-to pairs, and a handful of predictors centering around statistical spikes may work better. Then, you can just copy as many dispatchers as needed and jump to the one that statistically aggregates the instruction being exited/entered. This looks really nuts - each dispatcher is identical, and yet, due to internal hidden state, is better for handling some instructions but not others!

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 24 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: