I think I found it!
Boy, it sure feels right, but I’ve yet to be able to confirm. I wanted to share this for a sanity test now before I go to patch the boards. Here it goes:
The problem in question is that the game IK+ hangs after a few seconds when using the TTL CPU in the C64. At the time of failure, the CPU is caught in a tight IRQ loop, with /IRQ held low. The likely suspect is some side-effect (like the dead-cycle write of an INC instruction) being used to clear the IRQ, and the TTL CPU failing replicate this side-effect correctly. But closer inspection of the code revealed something different: the final execution of the IRQ service routine before the hang uses an invalid vector address, which lands the CPU in no man’s land. That’s why /IRQ goes flat. Tracing the game logic back, I can see that the IRQ vector is changed dynamically, and that it is finally corrupted by, of all things, a lost NMI!
More precisely put, the failure to execute an NMI prevents the reset of a critical variable, one which is used by the game to select the IRQ vector low-byte. A subsequent IRQ then increments this variable beyond its limit, and uses it to set the next IRQ vector incorrectly — with terminal consequences. The root cause is the missed NMI. Here is the critical moment:
Attachment:
Cap NMI w IRQ zoom.png [ 44.16 KiB | Viewed 3879 times ]
The capture shows that /NMI goes low while an IRQ is in progress, and it stays low even after the interrupt is cleared and /IRQ returns high. In fact, /NMI never goes high again. The three write cycles of the BRK sequence for the IRQ are clearly evident from R/W, but there is no sign of the NMI being serviced. It sure looks like the NMI interrupt is lost.
Sure enough, a look at the TTL CPU’s internal logic confirms that an NMI will be ignored if it arrives
after the fetch of the interrupt vector low-byte, but
before the end of the SYNC cycle at the end of the BRK sequence. This is exactly the time at which the NMI happens to arrive in the capture above. In fact, the chances of an NMI hitting this vulnerable window are enhanced in this case by the action of RDY (shown on the capture beginning at the cursor). It happens to go low to pause the CPU exactly during the critical SYNC cycle, extending the window of vulnerability over many cycles.
Ok, so what to do. Obviously we need to close the window, but this raises the obvious question: how does the NMOS 6502 react to overlapping IRQ and NMI events? Dieter shared with me some fascinating info on
visual6502.org in that regard. In particular, three points stand out:
- A “late” NMI, one which arrives after the BRK sequence has begun but before the interrupt vector low-byte has been fetched, hijacks the IRQ. In this case, the NMI vector is followed, and the IRQ will execute after the NMI is completed.
- A “later” NMI, one which arrives after the fetch of the vector low-byte but before the ISR has started, will interrupt the second instruction of the ISR.
- A “lost” NMI, one which arrives less than a 1/2 cycle before the fetch of the vector low-byte, is ignored and the NMI is lost.
The TTL CPU already reproduces #1 above — an NMI arriving before the fetch of the vector low-byte will hijack an IRQ already in progress. That’s great, but unfortunately, the other two behaviours are not correct. The problem is that the TTL CPU clears the internal flags for all pending interrupts at the end of the BRK sequence. Admittedly, that’s a bit of a blunt instrument, but pretty much ok for /IRQ and /RES. Both these signals are
level-sensitive and will simply be captured once again as soon as the BRK sequence completes.
Of course that’s not true for NMI, which is
edge-sensitive. In that case, the pending interrupt flag must not be cleared unless the NMI interrupt is actually being serviced. Now, we know NMI will hijack the interrupt if it arrives before the vector low-byte fetch, so we can be sure we are indeed clearing only a serviced NMI if the flags are cleared immediately after the low-byte fetch (i.e. upon the fall of PHI2 in that cycle). The internal pending-interrupt flags are used only to generate the correct vector low-byte address, so we don’t really need them once that job is done. The fix therefore should not break anything, so far as I can tell, and deal with the problem.
But even with that, a small window still remains while the internal flags are actually being cleared. A low-going ~15ns pulse is used to clear the internal flip-flops, and an NMI will be lost if it is received exactly within that 15ns window. Now this is no worse than the NMOS 6502, which reportedly loses NMIs that land within a 1/2 cycle of the vector low-byte fetch. Just to confirm, I left the IK+ game running in the opening animation sequence for an extended period — something I had not done to this point. If in fact the 6510 in the C64 loses NMI interrupts, then it should be only a matter of time before it too hangs, right?
Well, lo and behold, yes — indeed the game hangs after a few minutes with NMI stuck low. I repeated the tests several times just to be sure, with the same result every time. Wow, turns out the bug was always there!
It’s just that the TTL CPU’s longer vulnerability window makes it more likely to occur. Man o’ man, I would not have guessed that! (Strangely enough, the problem does not seem to resurface during game play, which looks very much like the opening animation. This might also just be only a mater of time, however.)
So, now, with this change, the behaviour of the TTL CPU in this respect should match the 6510 with only two exceptions:
- the “Lost NMI” window is 15ns after the vector low-byte fetch, rather than a 1/2 cycle before it; and
- a “Later NMI” will interrupt the first instruction of the ISR that follows it, rather than the second.
At this stage, I can only hope that both these incompatibilities will turn out to be benign ... (Yes, hope springs eternal
).
Alright, so the next step is to look for yet another patch to the existing boards to try to fix this problem. In the meantime, I would appreciate a sanity test here. The specifics of 6502 interrupt handling are a bit tricky, to say the least, and it would not surprise me to learn that I’ve botched this up. Does the behavior described above seem correct? Is it the same on the 65C02? Many thanks in advance for any comments.
Cheers for now,
Drass