POC Computer Version One

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat May 08, 2021 9:42 pm

GARTHWILSON wrote:

I kind of expect it will turn out to be something simple that nobody here (including myself) has thought of.

It seems you are correct. The problem is simple (sort of), but unexpected.

I accidentally discovered I can make the problem appear/disappear by applying pressure to the top of the SRAM. In one case, I managed to crash the machine that way, evident because the "heart-beat" indicator (an LED indirectly driven by the IRQ line) went completely dark—it is normally dim until some IRQ-intensive activity occurs, e.g., serial I/O.

So it would seem I have a mechanical problem, not an electrical or timing issue. With that revelation, I put on the high-powered magnifiers (8X) and carefully examined the SRAM, which is an SOJ32 package. As near as I can tell, all joints look pristine, with the soldering having nice radii where the tucked-under J-leads meet up with the PCB pads.

As the soldering looks to be okay, it could be there is some "debris" under the SRAM that is creating an intermittent short of some kind. I can't see in there, so it's only supposition. The other possibility is the SRAM has an internal data pin connection that is disturbed by pressure on the package.

Either way, it appears replacement of the SRAM will be required to ferret out the problem, a task that I can't carry out.

GARTHWILSON · Post by **GARTHWILSON** » Sun May 09, 2021 1:44 am

BigDumbDinosaur wrote:

As the soldering looks to be okay, it could be there is some "debris" under the SRAM that is creating an intermittent short of some kind. I can't see in there, so it's only supposition. The other possibility is the SRAM has an internal data pin connection that is disturbed by pressure on the package.

When we were selling something with a Rockwell 65c22, there was one that came back bad and it was pretty clearly a faulty wire bond inside from the lead frame to the die on one of the port bits. I've only seen that kind of thing once in all my years. It passed production testing, but whatever vibration and temperature cycling it experienced in the field made it break the connection.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun May 09, 2021 5:33 am

GARTHWILSON wrote:

When we were selling something with a Rockwell 65c22, there was one that came back bad and it was pretty clearly a faulty wire bond inside from the lead frame to the die on one of the port bits. I've only seen that kind of thing once in all my years. It passed production testing, but whatever vibration and temperature cycling it experienced in the field made it break the connection.

I remember in the distant past that applying heat for too long to device pins would occasionally cause a wire bond to let go. However, SMT devices are designed to be exposed to high temperatures during reflow (~450° F), so it would be expected that the bonds should be tolerant of considerable heating. Only way this will get solved is with replacement of the SRAM.

Dr Jefyll · Post by **Dr Jefyll** » Sun May 09, 2021 6:04 am

GARTHWILSON wrote:

I kind of expect it will turn out to be something simple that nobody here (including myself) has thought of.

Yes, I too have that feeling.

BigDumbDinosaur wrote:

Either way, it appears replacement of the SRAM will be required to ferret out the problem, a task that I can't carry out.

I wouldn't give up on that RAM yet -- especially since, as you say, replacing it won't be easy.

BigDumbDinosaur wrote:

BTW, an interesting thing came up when I tried to run V1.2 with V1.1's MPU. It wouldn't boot, even with Ø2 at 4 MHz. V1.1's MPU is an old part with a Sanyo 0.8µ core. It will run at 14.1 MHz in POC V1.1.

This is a rather conspicuous clue!

And I don't believe the "broken bond wire" or "debris under the SRAM" theory can account for it.

If you re-install the 0.8µ CPU and find yourself unable to reproduce the intermittent behavior (by pressing on the RAM) then that'd be highly significant. Worth trying again, I'd say. Failing that, the original observation is worth remembering.

-- Jeff

PS: I have a theory that's a fairly good fit for the memory problem and also explains the alternative-CPU observation. Anyone see where I'm going with this?

I believe the RAM is working within specification, BTW...

enso · Post by **enso** » Sun May 09, 2021 7:29 am

Dr. Jefyll, I am not sure where you are going, but I also don't think the problem is entirely mechanical. The smell is a little off.

BDD, how much pressure is needed to subvert the system? Is it possible that it's not pressure but rather the capacitance of your thumb that's tipping the circuit?

I still suspect that there is a very tight timing involved and a some kind of noise or signal integrity issue which would otherwise pass unnoticed causes the problem. But I've been wrong before, and look forward to being wrong many more times in the future.

It is interesting that an older and larger-geometry CPU fails... This implies slower rise/fall times, and therefore a slightly smaller window for some critical operation - perhaps the latching of the high address bits... Or some other issue with decoding logic...

GARTHWILSON · Post by **GARTHWILSON** » Sun May 09, 2021 7:43 am

BigDumbDinosaur wrote:

GARTHWILSON wrote:

When we were selling something with a Rockwell 65c22, there was one that came back bad and it was pretty clearly a faulty wire bond inside from the lead frame to the die on one of the port bits. I've only seen that kind of thing once in all my years. It passed production testing, but whatever vibration and temperature cycling it experienced in the field made it break the connection.

I remember in the distant past that applying heat for too long to device pins would occasionally cause a wire bond to let go. However, SMT devices are designed to be exposed to high temperatures during reflow (~450° F), so it would be expected that the bonds should be tolerant of considerable heating. Only way this will get solved is with replacement of the SRAM.

On the 65c22, the wire bond had to already be very marginal to fail. In mentioning temperature, I was not thinking of high soldering temperature, but the temperature cycling it could get in the field. I've seen MOSFETs actually operating at 350°C (660°F) or higher, and neither the silicon nor the bondwires failed (although it sure wouldn't last long at that temperature). I say "or higher" because I was not allowed to get calibration data on the IR microscope above that. It was definitely above it, but I don't know how much. This microscope let me measure the temperature of a .001" circle on the actual die, as I moved up and down the finger rows in the power transistors made by the company I was working for in the engineering lab.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon May 10, 2021 2:46 am

enso wrote:

BDD, how much pressure is needed to subvert the system? Is it possible that it's not pressure but rather the capacitance of your thumb that's tipping the circuit?

A fair amount, actually.

I accidentally discovered it when I picked up the unit while it was running, and had grasped it on the SRAM (I'm more-or-less left-handed and like a lot of double bass players, have a strong left-hand grip—I probably applied much more pressure to the unit than I thought). I had verified that I had discharged myself to avoid ESD, but when the unit went belly-up I thought I had zapped it. Resetting resumed operation. That was when I made a determined effort to make the problem appear again, which led to repeatedly applying pressure the SRAM. I am relatively confident capacitance that my hand might have added is not the likely culprit.

The problem is hard to trigger, so scoping isn't really much of a help.

Quote:

I still suspect that there is a very tight timing involved and a some kind of noise or signal integrity issue which would otherwise pass unnoticed causes the problem. But I've been wrong before, and look forward to being wrong many more times in the future.

The SRAM itself is a 12ns part and there are only two gate delays between it and the address bus. Plus I was able to provoke the problem with Ø2 slowed down to 12.5 MHz. As everything is 74AC logic, mostly with single-digit prop times, I'm not seeing timing as the culprit. Also, when I got the unit running and clocked at 20 MHz, I had examined it with the logic analyzer (which has 2ns resolution) and could see that I had decent timing headroom.

: POC RAM Read Timing @ 20 MHz

Quote:

It is interesting that an older and larger-geometry CPU fails... This implies slower rise/fall times, and therefore a slightly smaller window for some critical operation - perhaps the latching of the high address bits... Or some other issue with decoding logic...

Also, that older MPU has a Sanyo die, not TSMC. In the history of the 65C816, it was switched from the 0.8µ geometry to 0.6µ around 2004, but still with Sanyo dice. Then a rather sudden change was made from Sanyo to TSMC and very few Sanyo dice found their way into production parts. At the time when I had queried WDC about the change in connection with something else, I got the vague impression the Sanyo dice were marginal in performance in some way, which is why a different foundry was chosen. Of course, cost could have also been a consideration.

Something of note is around the time the switch to the 0.6µ geometry was made there was increased interest in 3.3 volt operation in embedded systems. In passing, it was also around that time that SRAM producers such as Cypress and ISSI reduced output levels from CMOS to TTL. I'm sure WDC had to be aware of the change—a lot of units using WDC MPUs would also be using SRAM—and what it would mean to system designers mating a device with TTL outputs to a device with CMOS inputs. That would support my theory that despite what the data sheet says, 816's inputs are TTL-compatible.

Dr Jefyll wrote:

PS: I have a theory that's a fairly good fit for the memory problem and also explains the alternative-CPU observation. Anyone see where I'm going with this?

I believe the RAM is working within specification, BTW...

I'm not sure—I want to hear your theory to see if we're on the same wavelength.

enso · Post by **enso** » Mon May 10, 2021 4:15 am

.....DRUM ROLL....

Dr Jefyll · Post by **Dr Jefyll** » Mon May 10, 2021 5:32 am

Thank you, Enso.

BDD, I'm not fully convinced I know what POC's problem is, but there's a related subject regarding which you and I have repeatedly butted heads, and I've found you remarkably stubb.. uh, I mean resolute!

The latest skirmish doesn't seem to have convinced you, so I'm giving it one more shot, here:

TTL Compatible... NOT! ( WDC ).

Related to this, I do have a simple experiment you can try on POC, but I hesitated to mention it. I will do so now.

Put some pullup resistors on the data bus and see what happens!

More likely than not, what's been going on is this. Purely as a matter of luck, you happened to get a RAM whose outputs don't exceed the TTL VOH spec by much. That'd leave you with no noise immunity, as noted in the "skirmish" link above. In other words, that particular RAM chip can barely say "1" in a way that that particular '816 can hear. The transition point of the 816's data bus input is only marginally within the RAM's reach.

And... purely as a matter of luck, the input transition point of the other '816 happen to be a few millivolts higher than that of the original '816. That would explain why a RAM "problem" is affected by swapping the CPU.

As for getting an effect from pressing on the RAM, who knows! If you've got a system with zero noise immunity, heck -- all you have to do is look at it sideways. Enso's comment about capacitance is as apt as any.

If pullups fix POC's problem, then you owe me a LOT of beer if we ever get a chance to sit down together and drink it!

-- Jeff

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon May 10, 2021 7:30 am

Dr Jefyll wrote:

Related to this, I do have a simple experiment you can try on POC, but I hesitated to mention it. I will do so now.

Put some pullup resistors on the data bus and see what happens!

I don't have any suitable SIP resistors, so I'll have to monkey-rig something with regular quarter-watt parts. Might be able to try it out in a day or two.

Quote:

As for getting an effect from pressing on the RAM, who knows! If you've got a system with zero noise immunity, heck -- all you have to do is look at it sideways. Enso's comment about capacitance is as apt as any.

I would agree except I can ground my hand with my static strap and run my fingers all over the SRAM's pins without upsetting the apple cart. I can remove the static strap and still can't get the machine to act up by touching things. It takes some mechanical stress on the SRAM to trigger the problem, and I'm not even sure that that is the actual culprit. It could be some other mechanical thing is being perturbed.

I can report that V1.2 has been running without a hitch for close to two days, and I've been using it to test and debug my updated string library. It was while testing a library function that I became aware of the problem, yet so far it's not come back to bite me.

Quote:

If pullups fix POC's problem, then you owe me a LOT of beer if we ever get a chance to sit down together and drink it!

Okay, you're on! I assume you go for fine craft beers instead of that mass-produced, kidney-killing stuff.

And if pullups make no change you're on the hook for a juicy steak cut from prime Canadian beef.

More seriously, if it does turn out that marginal drive from the SRAM is behind this problem it would be a strong argument in favor of using a data bus transceiver to get signals up to Vcc when RAM, ROM or I/O is talking. Of course, a transceiver inserts some prop delay into the data path, but 74ACT and 74AHCT parts are down in the single digits. I'd have to design a new unit to go that route.

enso · Post by **enso** » Mon May 10, 2021 12:45 pm

Well, that blows.

Figures that this CPU I for some reason have a very soft spot for -- is a terrible beast! I've been watching the 'Not compatible' discussion as a theoretical topic as the cpus 'work' with everything... but this makes it real. The marginal smell I've been experiencing.

Jeff, have you done any measurements at 3.3V? I suppose it won't be quite right at 3.3V either.

One way to deal with this is to characterize SRAMs and EPROMS/flash and peripherals for WDC voltage compatibility ourselves... Still does not guarantee much from batch to batch, but since everything is almost compatible, some brands are probably more almost-compatible than others.

BDD, what SRAMs are you using? After you confirm this, we should probably start a compatibility thread. Can't wait for the results.

plasmo · Post by **plasmo** » Mon May 10, 2021 2:31 pm

If marginal logic high voltage is the problem, then high value pull-down resistors (10-20K) should bring out the failures consistently.
Bill

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon May 10, 2021 8:55 pm

enso wrote:

BDD, what SRAMs are you using? After you confirm this, we should probably start a compatibility thread. Can't wait for the results.

All of my POC V1 units have used the ISSI IS61C1024AL, which is a 128K, 12ns unit with TTL-compatible inputs and outputs. VOH is specified as 2.4 minimum. That number, of course, assumes maximum loading. If everything in the circuit is CMOS, I'd expect the output voltage to be higher, although the theoretical limit for a device with totem-pole outputs is around 3.5 volts, which is the ragged edge of being a CMOS logic 1 in a 5 volt system.

I'm hoping to soon have some information from WDC that should settle the nature of the 65C02's and 65C816's inputs. Based on some testing Jeff says he's done, it appears those inputs are likely 50/50 with Schmitt triggering. If that is the case, and assuming use of a PCB design that keeps parasitic capacitance to a minimum, the C02/816 should reliably respond to the output of the SRAM. My POC V1.1 unit, operating at 14 MHz, has been 100 percent rock-solid using that SRAM, with no provisions for trying to boost the output, e.g., with use of pullups.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu May 13, 2021 5:46 am

I have not tried the pullup experiment, as I am reasonably certain at this time that the intermittent failure of POC V1.2 when I put pressure on the SRAM is due to a mechanical issue. V1.2 has been running steadily at 20 MHz for 84.7 hours as of this writing, and I have been using it for software testing. If a marginal input to the 65C816 was behind the failures I would have expected one to occur by now. Everything is working fine.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 19, 2021 9:36 pm

It's a depressing time at the dinosaur pen. The sky has gone grey, there are cold sheets of rain and sadness is the order of the day.

POC V1.2 has gone dead. Or, as a Russian might say, "No bootsky." Now, where's my logic probe?

Not sure at this point what happened, but I did mention a few posts back that the unit crashed when I randomly applied some pressure in the vicinity of the SRAM. So I started by examining that area of the board with my 8X magnifier, looking for a possible cause. The solder joints on the SRAM's pins look pristine and I didn't see any foreign matter that might have caused an intermittent short. The examination continued with me checking every soldered connection on the board, especially where I had bodge-wired it to add wait-stating. Nothing stood out.

The thought has occurred to me that the PCB might have a mechanical defect, such as a via that isn't maintaining continuity through the board. In years past, that sort of defect occasionally appeared, but I haven't encountered it since at least the latter 1970s. The board house that produced this PCB is a high-quality source that I have used many times before without incident. So I'm not seeing a board defect as the cause. In any case, the only way I'm going to eliminate a bad via is by checking continuity the old-fashioned way, which for me would be a very laborious task due to the small sizes involved—almost all via are .026" (.66mm) diameter with a .008" (.20mm) hole.

I did examine the area for possible trace fractures, but again nothing stood out. While I can't completely rule out a PCB defect, I really don't think it's likely.

My other suspect is the SRAM itself. I can probably get some insight on that by using my Ø2 single-stepper to cycle through the early stage of the reset handler, which looks like this:

Code: Select all

hrst     sei                   ;no IRQs during 1st stage POST
         sec                   ;select 'C02 mode to...
         xce                   ;default MPU state
         clc
         xce                   ;select '816 mode
;
;
;	destructively test critical RAM areas...
;
         lda #testpata         ;1st test pattern
         ldx #0
;
.0000010 sta mm_ram,x          ;write physical direct page
         sta absolute,x        ;write kernel page
         sta hwstackb,x        ;write stack
         inx
         bne .0000010          ;next

Watching the SRAM's /CS and /OE pins with the logic probe should at least tell me if RAM is being selected and written to, which if the case, would be indirectly telling me ROM is being selected and read.

Later in the reset handler:

Code: Select all

         txy                   ;$00 —> .Y
;
.0000020 inx                   ;waste time to see if...
         bne .0000020          ;RAM holds
;
         iny
         bne .0000020
;
.0000030 cmp mm_ram,x          ;check for proper pattern
         beq .0000050          ;we're good so far
;
.0000040 stp                   ;fatal, halt MPU...
;
;	————————————————————————————————————————————————————————————————
;	In the event testing fails the only thing that can be done is to
;	halt the MPU.   It isn't possible to report anything to the user
;	because I/O isn't operable this early in the testing.
;	————————————————————————————————————————————————————————————————
;
.0000050 cmp absolute,x        ;check kernel page
         bne .0000040          ;error
;
         cmp hwstackb,x        ;check stack
         bne .0000040          ;error

If any one of the above RAM checks fails the MPU will be halted, which condition could be detected by examining the MPU's VPA signal. If the MPU is running and executing instructions, I would expect to see VPA continuously cycling, since VPA goes high when the MPU fetches an opcode or an operand, and goes low during a data fetch. If the STP instruction gets executed due to a RAM error, all activity would cease and I would expect to see VPA continuously low.

As I said, where's my logic probe?

POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

POC Computer Version One Point Two