Dr Jefyll wrote:
...Before a failure the system will...
- fetch and execute a BRA (repeats zillions of times, until... )
- IRQ goes low and firmware decides which interrupt(s) it thinks need service
- the corresponding service routine(s) are executed
- an RTI occurs, which ought to take us back to the BRA loop (unless the stack's corrupted)
Only one link in the chain needs to break. Might there be, or appear to be, a spurious interrupt (ie, not the timer/jiffy) from the UART?
The oscilloscope says there are no spurious IRQs. After the POST screen has been displayed and the firmware goes into the spin loop, the only activity I see on /IRQ is a series of very short duration negative pulses spaced exactly 10ms apart. As the jiffy IRQ rate is 100 Hz, that pulse spacing is correct.
Quote:
If the code that determines the interrupt source had a bug, what would happen if an inappropriate routine were called? Are there calls to RAM for any of that stuff?
The entire IRQ handler (ISR) is in ROM but of course, accesses both direct page and stack space. There is no code in RAM, however.
At present, the ISR only knows about the three programmed QUART interrupts: channel A receiver full, channel A transmitter empty, and timer A underflow. Were the QUART to generate a spurious interrupt, the MPU would repeatedly execute the ISR in a futile effort to find and service the IRQ. That condition would result in /IRQ being continuously low, with no pulse activity, and RDY mostly low, but with pulse activity. Since IRQs are not unmasked until SR is pulled by the RTI instruction, repeated attempts to service a spurious IRQ will not underflow the stack.
Both direct page and the stack are located in a block of RAM that extends from $00D800 to $00DEFF, with the stack pointer initialized to $DEFF during the early stages of POST. The memory map for that range is as follows:
Code:
FIRMWARE WORKSPACE DEFINITIONS
;
000000 kerneldb =$00 ;default data bank
00D800 kerneldp =d8ram ;virtual direct page
00D900 workspac =kerneldp+s_rampag ;start of vectors & tables
00DA00 siocfifo =workspac+s_rampag ;start of SIO CFIFOs
00DEFF hwstack =hmubas-1 ;top of MPU stack
The space allocated to serial I/O (SIO) queues (CFIFOs) extends from $00DA00 to $00DDFF inclusive. Hence the stack is 256 bytes, more than enough to handle anything the firmware would require.
Quote:
The reason I ask about RAM is because that's where PC apparently points after a failure (as shown by RDY going and staying high).
Same thing I'm thinking. An inadvertently executed SToP instruction would produce the conditions I described.
LIV2 wrote:
If it was hitting a WAI/STP instruction it'd pull RDY low wouldn't it? BDDs observation is that it stops with RDY high
According to the 65C816 data sheet, executing STP has no effect on RDY. Therefore, if in fact accidentally executing STP is what is crashing the machine /IRQ being continuously low and RDY being continuously high following the crash would make sense.
GaBuZoMeu wrote:
To get a more detailed picture of what "fatality" might be, you could use the NMI (manually triggered once the MPU is in the weeds) to get a dump of the current state (registers, flags, top of stack +/- 10 entries). As your serial ports might not work properly at that moment I would dump these information into some non used RAM locations. There they can be verified after RESET.
Unfortunately, the unit is not yet to the point where something like that is practical.
Quote:
Does your system issue some sort of message when there is a "spurious" IRQ?
No.
Quote:
Does you have code (and messages) provided that responds to IRQs, BRKs, NMIs in the case of being in emulation mode? What would happen then? (I assume the POC is usually running in native mode.)
At reset, the 65C816 is put into native mode and it stays there. If the MPU somehow gets switched back to emulation mode and an interrupt hits the machine will crash. There is no support for emulation mode in the firmware (also true of POC V1.1).
Quote:
You may expand your idle loop with two verifications: a) is the SP pointing to the regular place? b) is the state of the flags (especially M,X, and I) what it should be?
Checking and/or correcting SP would be fairly easy. I can't easily observe the others, but once the MPU has entered the HERE: BRA HERE loop,
m and
x wouldn't matter.
Dr Jefyll wrote:
Yes. My main focus was on pointing out that a crashed CPU usually keeps executing something.
Just to amplify Jeff's comment, in the case of the 65C816 all opcodes are documented instructions. Therefore the '816 really can't be killed by executing any opcode, other than STP and WAI.
Quote:
It's true I didn't thoroughly discuss the two exceptions, STP and WAI. (It occurred to me to wonder about BDD's CPLD logic, which sometimes actively pulls RDY high. But if a bug in that department resulted in a tug-of-war on RDY then I think there'd be some noticeably hot chips, or a loss of +5.)
The CPLD is the biggest power-consumer in the unit and is slightly warm to the touch after a period of time. All the other major chips are barely warm to the touch. While that isn't a scientific test to determine if something is using too much juice, it's reasonable. Power at the input jack on the unit is 5.02 volts, which is what I have been seeing all along.
Now, here's where it gets curiouser and curiouser. I powered up the unit at 0638 ZULU time last night (May 24), right after I had reprogrammed the CPLD with my tidied-up code. As I write this, it is 0611 ZULU time on May 25, the unit has been continuously running since that startup the previous night and is still alive. The probe on /IRQ shows normal activity, as does the probe on RDY. In other words, everything appears to be copacetic right now. I'm wondering if reprogramming the CPLD accidentally fixed the damned thing.