CALLs take up extra resources of RAS (return predictor), which has a limited depth. The indirection is handled using the same circuitry, so off the top of my head, same prediction for indirect JMP or CALL (although you should always verify in a correctly framed microbenchmark, which is not always the easy to come up with). Not jacking the stack also gives you the ability to use it for other purposes - maybe keeping the 6502 return stack cheaply, or storing some state along the way.
If you must CALL, what is really important is that every CALL has its return address not messed with(
http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ ). JMP chaining is good by reducing the load on RAS and the microop buffer by a bit. Tail call elimination is an accepted practice - and a must in a real compiler (
https://fare.livejournal.com/142410.html - another reason to not take Python seriously).
Once you've hit a wall with direct interpretation of opcodes, your best gains will probably come from doubling up on commonly used pairs, like compare/branch combinations. Short of 64K handlers, selective probing ahead may be a win for common combinations. Again, guessing, but a handler that sets, say, Z, could probe the next opcode (cheaply, since it is in the cache - or even in a register if you are clevely reading a word or even a qword but dispatching on low byte, etc.) to see if it's a branch. Perhaps there is a conditional set of the emulated PC or something that avoids an actual branch and or even a separate conditional branch - having said prediction be separated from the end-of-handler may be a win (by spreading the prediction load).
But of course, the important thing is to avoid a single dispatcher. Even if it means your code grows by megabytes. Choosing to implementing branch prediction is a pact with the Devil, and there are consequences of dealing with the hidden state for interpreter writers.
[EDIT] Come to think of it, if you do go with 64K double-opcode dispatchers (as mentioned above, the actual number is probably in the low thousands as many combinations hit literals or are not sensible... Although processing literals may be an optimization as well!)... Having a separate dispatcher for each, if it is big enough to pull in a new cache line, may be also sub-optimal. If you analyze the statistics of opcode pairs, it is possible there are (largely) disjointed sets of common from-to pairs, and a handful of predictors centering around statistical spikes may work better. Then, you can just copy as many dispatchers as needed and jump to the one that statistically aggregates the instruction being exited/entered. This looks really nuts - each dispatcher is identical, and yet, due to internal hidden state, is better for handling some instructions but not others!