Single stepping with RDY, re-implementing Woz's circuit

akohlbecker · Post by **akohlbecker** » Mon Jul 24, 2023 1:55 pm

fachat wrote:

I am not sure I understand your schematics. Why is the output of the '541 always connected to the data bus?

But anyway, what tool are you using to draw these timing diagrams?

It is a bus holding device, it ensures the bus is always pulled up / down to the last valid value. So when the VIA stops outputting at the end of the first cycle, the bus holder keeps the value in place until another device takes over.

I mentioned it above, I'm using wavedrom, see below

akohlbecker wrote:

The rendering engine for the diagram is Wavedrom https://wavedrom.com/, with a few patches by me (most importantly a narrower theme for more horizontal space) as seen on my GitHub here https://github.com/wavedrom/wavedrom/co ... drom:trunk

Now, this tool generates cool looking diagrams but its syntax is obnoxiously arcane. I have a library I haven't published yet that helps me with writing it in a sane way.

Proxy · Post by **Proxy** » Mon Jul 24, 2023 1:57 pm

EDIT: oops didn't see the second post already answering

akohlbecker · Post by **akohlbecker** » Mon Jul 24, 2023 2:06 pm

For giggles, here is the source for the previous diagram I posted, generated by my library (I don't think anyone should try writing this manually).

I have manually written a few diagrams, which you can see on my BB816 repo's README here https://github.com/adrienkohlbecker/BB8 ... g-diagrams (their source is available by expanding the arrow above each image, but they'll only work in my fork of the tool)

The main issues of this tool are its use of capital letters to mark legends (of which you run out really fast in the western alphabet leading to the use of weird characters), and the fact that you need to repeat "." for every repeated horizontal data in your signal (meaning to make a 10ns low signal, you need to write "0" then manually repeat "." nine times).

Both can, however, be automated, which my library is doing. I intend to publish it some time in the future.

akohlbecker · Post by **akohlbecker** » Mon Jul 24, 2023 3:16 pm

Thinking about how to request an additional RDY cycle for writes led me to consider how to merge multiple RDY sources. The single stepping circuit shown in this thread needs to be updated, because at the moment it takes its feedback input directly from its output.

However, if there are other sources holding RDY low, then they could prevent the single stepped cycle from happening. The circuit should instead wait for all other sources to release RDY, so a valid clock cycle can happen, before activating again.

So, we need update the feedback from U2A to U3B to be the combined /RDY signal, and find a way to AND together multiple RDY sources. Here I've used a '151 to do that.

In instruction stepping mode, the use of SYNC already ensures it reaches a new instruction, so this concern doesn't apply.

gfoot · Post by **gfoot** » Tue Jul 25, 2023 8:55 am

Some random thoughts that may or may not be useful:

Be careful with the OR gate that's generating IOCLK - the state of RDY could change during the clock-low period, causing unwanted edges on IOCLK in ways that don't correspond to what the CPU does (which is to wait for a falling edge on its clock with RDY asserted). Ideally I think you want something more like a D flipflop that gets set when the main clock is high, is clocked by the falling edge of the main clock, and on these edges sets itself to the opposite of RDY. That might not work though as its "set" input would have been active not long before the edge on its "clock" input, so it makes me nervous. You own circuit (I believe) generates RDY in sync with the clock so won't suffer from this, but other RDY sources may not do that.

The margins for this (how long before PHI2 falls does RDY need to be stable, what does the CPU do if it's not?) may also differ for your circuit compared to the CPU itself, and it begs the question, why not just feed this stretched clock signal to the CPU directly and not use RDY, so that it's 100% clear at any point whether the CPU is paused or not? Your clock-stretch circuit is then a single point of truth on the matter, rather than having to partially replicate what the CPU is doing external to the CPU itself. I'm not sure whether this is why BigDumbDinosaur advocates clock stretching over use of RDY, but it does feel quite compelling.

Regarding combining RDY from multiple sources, if you use diodes to connect them to the CPU's RDY pin with a pull-up resistor then that will naturally wire-OR the RDY signals, so you might be able to use the result directly instead of needing your own gate for it.

It feels to me like single-cycle stepping wants to pull RDY low after one successfully-executed cycle - as above, for asynchronous RDY signals from other sources this could be hard to decide as other sources may cause a cycle to not go ahead, and as above it's one case where at least with a stretched clock you do know for sure whether a cycle was executed. With RDY, I guess you're looking for a falling clock edge with RDY high, and hoping your circuit's judgement on this matches the CPU's. Single-instruction stepping wants to do the same thing but only stop again on the next cycle where SYNC is high - so it needs to look for that falling clock edge with RDY floating high, and then check for SYNC on the next rising clock edge, and pull RDY low if it finds it.

Both these cases need maybe to be though of backwards - in some sense, when debugging is enabled, the natural state is to pause, and these are two different ways to temporarily "override" that paused state. This may be going way off into the weeds, but it feels like a single-cycle step means "unpause (but RDY may still be pulled low by something else); wait for a falling clock edge with RDY floating high; then make a note to pause again on the next rising clock edge" - and the only difference for single-instruction stepping is in that last part where we only want to pause again on a rising edge if SYNC is asserted. What other systems do with RDY in the meantime doesn't matter at all, so long as we observe at least one end-of-PHI2-with-RDY-high event - other systems may have their reasons for delaying later cycles during an instruction for example, but we don't care so long as at least one cycle gets executed and the instruction eventually gets completed - our only interest is making sure that the debug pause circuit doesn't hold RDY low again until the time comes.

For what it's worth, I believe what the Woz and MOS circuits were doing was mostly having a long-term state (physical switch!) that controls whether we debug-pause every cycle or only when SYNC is high, and another mechanism to briefly unpause - which indeed makes sense, it's just important when asked to step, that you do wait for an actual RDY-high falling clock edge.

Sorry that's probably a bit rambly, but I think it is a complex topic, and not one I've thought about a lot, these are just the things that came to mind when you mentioned other parts of the circuit also wanting to use RDY.

plasmo · Post by **plasmo** » Tue Jul 25, 2023 7:53 pm

RDY and clock stretching are not mutually exclusive. You can use both which may simplify your logic because single-source circuit is generally simpler than circuit that accommodates multiple sources.
Bill

akohlbecker · Post by **akohlbecker** » Thu Jul 27, 2023 1:36 pm

Thought challenging comments, thanks. How to synchronize multiple sources and handling edge cases like the ones you mention Georges is the kind of feedback/review I need

I guess I need to specify my use case better, then I can see whether I need more safety or not.

At a high level, I would like a single stepping circuit that can pause any cycle indefinitely, controlled by an external system (debugger). I would still like to have the capability to stop the CPU for other reasons, which could be used for I/O wait states or bus mastering (eg. Ben Eater's VGA). Ideally, I would expose one such control pin to the external bus, and would not impose restrictions on its use, so I remain flexible in what type of circuits I build next. There could be multiple sources to merge if I decide to go for a backplane with separate cards, but it is not a hard requirement. Pausing the CPU (either for debugging, or wait states / DMA) should preserve the VIA counters. In particular, Ben's VGA card can pause the CPU at any time, including VIA cycles, so it needs a lot of flexibility to do that. Regular clock stretching for wait states is fine since I won't stretch the VIA cycles, but the VGA bus master use case is harder.

One additional requirement for the single stepping circuit is to be able to gate interrupts, especially when manually stepping. But that's outside the scope of this thread.

Thinking outside the box, if I could find an interface IC to replace the VIA and not deal with its reliance on the CPU clock, that would simplify my project greatly. I thought about a PIA (65C21) but its output ports are a bit weird, and I would still need timers. I know the SCL UART has timers, but it is not working well as a UART for me. There may be some argument to researching this direction more. I wonder if I could create a device similar to the VIA inside a CPLD, at least with basic functionality, which might be nice to couple with 65SPI for even more I/O awesomeness.

But, it might also be nice to solve the VIA stepping problem for once, which I find an interesting challenge, so I'm not sure I want to replace it.

So, all in all, in my mainboard I want to have : the debugger and single stepping logic, one VIA or other interface chip, and one or more "RDY/clock stretching" input for wait states / bus mastering of peripheral cards. Maybe this is becoming outside the scope of this thread, if the solution ends up not being "generic"

gfoot · Post by **gfoot** » Thu Jul 27, 2023 4:52 pm

I don't remember how careful Ben was with the bus mastering but I remember feeling uneasy about it when he did it, it is very heavy-handed. I think he had two clock domains - one 10MHz, one 1MHz - and if the 10MHz one wanted the bus then the CPU had to release it straight away. I don't think that's good, as RDY is only sampled at the falling edge of the clock, it feels possible that if BE and RDY both go low at the wrong time (I think it's within 10nS ahead of PHI2 falling) the CPU might not pause for another 1us, while the video circuit immediately drives the address and data buses and starts accessing the main system RAM at 10MHz. If this happened during a read cycle it could violate the CPU's read data hold time and maybe the CPU would read bad data.

It also seems likely that VIA cycles would get replayed repeatedly during the non-blank portion of the scanline! Which you are planning to solve, at least!

I'd be more comfortable if the video circuit said at least 1us in advance that it wanted the CPU to stop, so that the CPU can cleanly pause at the end of a cycle before BE goes low. But that time depends on the CPU clock speed. I may also be over cautious here, I've never used BE apart from during resets for RAM initialisation and I've usually had my video RAM separate from system memory.

akohlbecker · Post by **akohlbecker** » Thu Jul 27, 2023 5:45 pm

Oh yeah, it definitely is a really poor design (he did name it the world's worst video card). There is random garbage at the start of each scanline due to this. Still, even a corrected version of his design would need the ability to pause the CPU in random cycles, including VIA accesses. To fix it, as you mention, some orchestration with BE will be needed, as well as advance notice.

On the BE side, I plan on designing a DRQ/DACK system for my external bus, that will control BE and RDY. DRQ requests access to the bus, and DACK is asserted when the bus is free.

fachat · Post by **fachat** » Thu Jul 27, 2023 8:26 pm

Now we've discussed a lot. But what problem are you actually trying to solve? For me it helps to explicitly list the requirements and design to purpose. Otherwise I tend to over-engineer if I don't watch out...

So what I read is (sorry if I missed something or over-interpret):

1. single-step the CPU - I guess from an external trigger (pushbutton or else)? single-stepping triggered by? released by? (after all, using an NMI and a timer, you can single-step the CPU just with software. I understand this is for debugging hardware and/or software, is it? It will be difficult to debug software as you need a specific hardware trigger. It is probably easier to use a logic analyzer circuit to trigger on specific hardware conditions (address, ...) to move into single-stepping mode. What is your intent?

2. hold the CPU for an undetermined amount of time, triggered by what? Similar question as 1. Released by what? pushbutton? What would that be for?

3. wait for slow peripherals - this again needs a hardware trigger, but can be derived from the select signal for such a slow device. However as the select signal keeps being active when the CPU is stopped (either with RDY or Phi2 stretched while high), a counter (even a single bit flipflops is a counter

) is required to disable the hold. As each device might need different delays, this is likely to be device-specific.
Are your devices capable of starting an access cycle in any of the CPU's cycle, or do they have their own (slower) clock, see below?

Anything I missed?

The only place I actually had to use RDY in my "classic" design was when I had a second CPU on the same bus. The second ("auxiliary") CPU was triggered by specific bus conditions (e.g. unmapped MMU page), and stopped the main CPU using RDY (while at the same time disabling RDY on the auxiliary CPU, so that it could start) http://www.6502.org/users/andre/csa/auxcpu/index.html It could also be used to single-step/trace the main CPU, as one other bus condition was executing an opcode (SYNC) on a page where the MMU had a "no-execute" bit set. A screenshot from such a single step can be seen on my page.
This is actually a complex design. The timing diagram here http://www.6502.org/users/andre/csa/auxcpu/timing1.jpg shows that I used a "2Phi2" clock to detect any relevant bus condition (either a switch from CPU to AUXCPU or back during phi2 high). Both CPUs are never on the bus at the same time, as you can see on the non-overlapping RDYOUT and AUXRDYOUT signals in the timing diagram (RDY is always released during phi2 low to give the main CPU time to react on it)

In my newer designs (with 65816 mostly) I use clock stretching when phi2 is high. It works by ORing the CPU's phi2 (and not necessarily I/O phi2, that can actually continue to run, depending on the device) with a stretch signal that is similar to RDY as generated above, but is not only set but also released during phi2 high.

Both those approaches work for devices that can start at any CPU clock cycle even if the access itself is slow. If you have a device with an own slow clock in the same clock domain, e.g. a 10MHz CPU, and a 1 MHz device where clocks are derived from the same source, you need to be able to detect the access at any of the 10 CPU cycles in the 1 MHz device cycle, and delay appropriately.

If you have CPU and devices in different clock domains (like a 10 MHz CPU clock derived from a 50MHz oscillator on an extension board, and 1 MHz I/O clock derived from a 16 MHz main board oscillator), it gets even more complicated and perhaps too much for that post.

So, long story short, what use cases do you have in mind? Can you list them and categorize them in the different situations I described? That should very much inform your circuitry.

akohlbecker · Post by **akohlbecker** » Thu Jul 27, 2023 9:24 pm

Hi Andre, appreciate the feedback and links. Let me try to specify the problem I'm trying to solve better

1. Control computer cycles from an external, independently clocked, system. Allow enough time for that system to read the state of the busses and control signals and process them, before going to the next cycle. My specific use case is an Arduino sending tracing info to a host computer, acting as a logic analyzer, with more pins but slower. Trigger will be a trio of '688 comparators on the address bus, for a hardware breakpoint set by the Arduino.
2. During an active tracing session, the computer needs to be slowed to a speed manageable by the Arduino and its serial transfers, which might be a few hundred kHz. This should be automatic, I don't want to have to tune the tracing speed and risk loosing cycles in the trace if it is not exact. So the Arduino needs to control the speed of the computer. It could take over the clock, this is how I'm currently doing it. But there is the risk of glitches during the takeover, and having a single clock source also makes designing the computer easier. Which is why I'm looking more into RDY and controlling the computer through this pin.
3. Tracing the execution should not impact running programs besides slowing down. VIA timers should keep running at the speed they've been intended to run at.
4. It should be possible to gate interrupts, so a given trace runs uninterrupted. For example, if I'm single stepping cycles manually through my Arduino, I don't want every instruction cycle to finish by going into the interrupt routines because timers are firing. On the other hand, it should also be possible to transparently run interrupts between some steps, by setting the hardware breakpoint to the next instruction, and resuming regular execution. I don't think executing interrupts is possible WHILE waiting at a given cycle, so that's the next best thing.
5. The user interface for this system is a terminal with the following commands: `b 007FFF` to set a breakpoint. `s` to single step a clock cycle. `r` to run at tracing speed. `q` to resume normal execution, `i` to toggle interrupt gating, `n` to go to the next instruction with interrupts in-between. Each clock cycle is printed to the console with the value of the address, the value of the data, and all status signals (RWB, VDA, VPA, E, M, X, ML, VP).

Note: I'm talking about Arduino for ease of understanding, but this will be a set of MCP23S17 sending bytes to a host computer over USB-to-SPI, no microcontroller involved. At the moment, I have the beginning of these ideas implemented in a Teensy, but there is a lot of work left to make it nice. Still, it has been immensely useful to develop software for the computer. Being able to set breakpoints and single step from there is great, even though it is currently very clunky with interrupts.

So, this is the single stepping circuit goal and what I'm focusing on at the moment. There are other blocks in my computer that may want to control execution (wait states, external peripherals doing DMA), but those are not really fleshed out yet, so I want to keep my options open by still supporting a generic use of RDY.

fachat · Post by **fachat** » Thu Jul 27, 2023 10:25 pm

Ok got some better idea.

Maybe it makes sense to look at a 2phi2 signal that samples the address comparators and the input from the arduino shortly after phi2 transitions and decides to set RDY or release it. At least IIRC that fulfills the CPU timing conditions for CMOS. The Arduino then just needs to check if RDY is actually low before reading/taking over thebus.

Anything else I believe can then be controlled by the Arduino.

André

fachat · Post by **fachat** » Thu Jul 27, 2023 10:26 pm

I.e. CMOS 6502s.

NMOS 6502 don't stop writes on RDY

akohlbecker · Post by **akohlbecker** » Thu Aug 03, 2023 9:02 pm

fachat wrote:

Ok got some better idea.

Maybe it makes sense to look at a 2phi2 signal that samples the address comparators and the input from the arduino shortly after phi2 transitions and decides to set RDY or release it. At least IIRC that fulfills the CPU timing conditions for CMOS. The Arduino then just needs to check if RDY is actually low before reading/taking over thebus.

Anything else I believe can then be controlled by the Arduino.

André

Yeah, that would probably work! I need to flesh out this circuit more and see where it takes me.

akohlbecker · Post by **akohlbecker** » Thu Aug 03, 2023 9:17 pm

I've been inspired by George's ideas and decided to toy with another way to stretch the clock. The idea would be to have a similar input to RDY, sampled around the falling edge of the clock, but which stretches the clock fed to the CPU instead of going to its RDY input. Thus circumventing the bi-directional nature of RDY and the difficulty to have multiple consumers agree on its state.

So, I came up with the following

On the left, there is my habitual delay line arrangement, generating 3 additional phases of the clock (CLK_SRC is -20±5ns, CLK- is -10±2.5ns and CLK+ is +10±2.5ns, using a DS1035M-10).

On the right, PHI2 is generated from a combination of the input CLK and RDY_IN sampled on the falling edge of CLK_SRC (so 20ns early). The sampling needs to be advanced to give time to the flip flop to propagate. Unfortunately, because the '112 is the only available logic IC I know capable of triggering on the falling edge of the clock, RDY_IN also needs to be inverted, which adds a bit of delay.

All in all, this gives the following timings at 14MHz:

As you can see, all this logic adds quite a bit of delay. If I shift the diagram to the point of view of PHI2, which is what the CPU and the user will see, we get the following

So, my goal of sampling this signal around the falling edge of the clock is clearly not reached

Still, maybe this will inspire other ideas!

Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit

Re: Single stepping with RDY, re-implementing Woz's circuit