Joined: Thu May 28, 2009 9:46 pm Posts: 8509 Location: Midwestern USA
|
Many pages ago, I described the SCSI host adapter I built to attach mass storage to my POC units. First a little history...
The core of the original host adapter was the 53C94 “advanced SCSI controller,” an NCR device that was second-sourced by AMD and others, and was used in the host adapters of a number of different systems. The C94 is a member of a family of NCR SASI/SCSI controllers that dates back to the early 1980s, well before SCSI had been defined as a formal standard. The C94 has considerable intelligence, offloading virtually all aspects of the bus protocol from the host system. That intelligence makes it possible to support SCSI without the need for the driver having to know much about the bus protocol.
A project of mine from the early 1990s used the C94’s ancestor, the 53C90A, to implement SCSI on a unit powered by a 65C02 running at 8 MHz. The C90A supported the then-nascent ANSI SCSI-2 standard (1989), as well as the older ANSI SCSI-1 standard (1986), and the still-older SASI bus developed by Shugart Associates in 1979 and first commercially supported by NCR in 1981. Using the C90A and some reasonably efficient programming, I was able to achieve a respectable transfer rate of around 350 KB/second with this setup, about 10 percent of the maximum raw, asynchronous SCSI bus speed supported by the C90A.
When I decided to add SCSI to POC V1, I initially tried to source the 53C90A to build the host adapter. An NOS liquidator that I had contacted advised me that they didn’t have the C90A, but did have the NCR 53C94, which can be described as a 53C90A on steroids. After studying the C94’s data sheet, I decided to design my host adapter around it.
The C94 is able to support programmed I/O (PIO) or direct memory access (DMA) transfers between the SCSI bus and the host system. In studying the C94’s data sheet, it soon became patent use of PIO is to be avoided, which was subsequently proved in reality during early experimentation. When operated in PIO mode, the C94 will generate an interrupt for each byte processed during any information transfer bus phase. Case in point: a typical hard disk block is 512 bytes. Hence 512 interrupts will be generated to read or write one block, plus other interrupts will generated as the SCSI bus changes phases and the host adapter and target device exchange status information.
Given the 15 bytes of overhead associated with the MPU responding to an interrupt, performance will suck. 7,680 clock cycles will be consumed in interrupt overhead alone to transfer one block—and then there will be the 20-or-so clock cycles needed to actually fetch and store each byte. I won’t even mention the preamble and postamble parts of the interrupt service routine that will be executed 512 times per block read or written.
On the other hand, when using DMA with the C94, the device will interrupt only when all bytes have been transferred or a bus phase change has occurred, reducing a typical SCSI transaction to no more than four or five interrupts. Almost all processing time will be expended in actual data transfer, which can be done quite efficiently with the right code.
The problem is in implementing DMA with the C94. I’m not aware of the existence of a DMA controller that is adaptable to a 65C816 system running in the 16+ MHz range. It might be possible to make a CPLD act as a DMA controller, but my CPLD programming skills aren’t at that level.
I had considered the possibility of rigging up a 65C02 as a sort-of DMA controller, taking advantage of the SOB input to quickly recognize when the C94 is ready to accept or provide some data. I also conjured another method using IRQB rigged up in a way that would trigger the byte transfer without sustaining the overhead of processing an interrupt (the SEI - WAI trick). I even thought of using an 816 as the DMA “controller,” along with some clock trickery for pacing transfers. After pondering how two MPUs would co-exist on the same bus (not simple) and doing myriad cycle counts to estimate performance, I concluded it would be a lot of work for, at best, a 2-to-1 speedup.
That said, I was bound and determined to figure out how to avoid doing PIO, which led me to conclude I could arrange the host adapter so the 65C816 could do “pretend DMA” in software. Performance would be nowhere near what could be achieved with a real DMA controller, but would be a lot faster than managing the IRQ storm that PIO operation would produce.
The C94 has 16 registers in total, which are accessed through conventional decoding in the POC unit’s I/O block. Like most I/O chips, the C94 has a chip select (/CS) input, which along with its A0-A3 inputs and the status of the /RD and /WD inputs, exposes a register to the data bus. Most registers are used to configure the device and report status. A single register, the “FIFO,” acts as an intermediary between the host system and the SCSI bus. Succinctly, during an information transfer bus phase, incoming data are read from the FIFO and outgoing data are written to the FIFO. In general, the host system doesn’t need to be concerned with how the C94 communicates with the SCSI bus, as bus access handshaking is completely automatic.
The interesting thing about the FIFO is it is accessible in one of two ways. If /CS is asserted and the bit pattern %0010 is placed on A3-A0, the FIFO can be read or written, same as other registers in the C94. This is the method used when PIO is employed to read from or write to a SCSI device. As earlier noted, each such access will cause the C94 to interrupt.
The C94 also has a DMA port that directly communicates with the FIFO. A separate C94 select input, /DACK, will expose the FIFO to the data bus—/CS is not used at all and in fact, must never be asserted when /DACK is asserted (experimentation with doing so caused major chaos). /DACK is used in conjunction with a C94 output called DREQ to form a handshaking setup that normally would be wired to a DMA controller.
During a DMA read operation, the C94 will assert DREQ when at least one byte is available in the FIFO. The DMA controller would respond by asserting /DACK to expose the FIFO to the data bus, fetching the waiting byte, releasing /DACK, and storing the byte in RAM. If more data is available in the FIFO, the C94 will again assert DREQ and the sequence can be repeated. As previously noted, the C94 takes care of synchronizing SCSI bus activity with data outflow from the FIFO—the DMA controller only has to fetch from the FIFO when DREQ is asserted.
During a DMA write operation, the C94 will assert DREQ if it can accept a byte—I should note the FIFO is 16-deep. The DMA controller would respond by fetching a byte from RAM, asserting /DACK to expose the FIFO, writing the byte into the FIFO and then releasing /DACK. If the C94 is able to accept more data, it will again assert DREQ. As with a read operation, SCSI bus synchronization is handled by the C94, and the DMA controller need not be concerned with it.
As part of the DMA setup, a register pair in the C94, the DMA transfer counter, must be loaded with a count of the bytes that are to be transferred, usually 512 per hard disk block. This counter is decremented each time /DACK is toggled. When the count reaches zero, the C94 will generate an interrupt. In a system with an actual DMA controller, that interrupt will be the signal to the microprocessor that the transfer has completed. In my “pretend DMA” setup, that interrupt is used in the driver software to route foreground execution, which design feature inadvertently created an obscure timing issue about which I will shortly bloviate.
In my initial work to adapt SCSI to POC V1.0, I started by using PIO to verify that my basic chip-select logic and I/O routines were functional. During that phase of experimentation, it became painfully apparent how slow PIO operation is. With that out of the way, the next step was to get “pretend DMA” working. This entailed assigning three different decoded I/O pages to the host adapter, one for general register access, a second page to directly access the DMA port by asserting /DACK, and a third one to fetch status information, which initially was the state of DREQ.
In POC V1.3, the current iteration of the POC series, the C94’s /CS will be asserted if there is an access to the $00C4xx I/O page, which gives register access according to the bit pattern on A0-A3. An access to any address in the $00C5xx I/O page will assert /DACK, hence exposing the FIFO to the data bus. With either access method, the operation to be performed is determined by the state of /RD and /WD, both signals being qualified by Ø2 in usual 6502 fashion.
Generating host adapter status is a little more involved. Part of the host adapter circuitry is a 74ACT541 bus driver, which is wired so a read access of any address in the $00C6xx I/O page will connect the ACT541 to the data bus. Bit 7 reflects the state of DREQ, and in the second generation host adapter design, which this new host adapter design replaces, bits 2-0 indicate the host adapter’s SCSI bus ID, which is jumper-settable (the bus ID is read during POST and used to configure the C94’s notion of its bus ID). Bits 3-6 are unused and always return 0. By having DREQ appear as bit 7, it is easily tested with the BIT instruction.
With this arrangement, the code to produce a “pretend DMA” transfer is quite simple, once preliminary operations have configured the C94 as required and the target device (disk, tape, etc.) has been selected and has responded. For example, writing to the SCSI bus is as follows (not the exact code, but functionally identical):
Code: dmawrit sep #%00100000 ;8-bit accumulator rep #%00010000 ;16-bit index ldy !#0 ;storage index ; .loop bit c94_stat ;can data be accepted? bpl .loop ;no ; lda [ptr01],y ;fetch from RAM &... sta c94_dma ;store to C94 DMA port iny bra .loop ;next Using BIT on C94_STAT reports the status of DREQ, which will be low if the C94 is not ready to accept a byte, thus implementing the required handshaking. As soon as DREQ goes true, a byte can be written to the DMA port. An access to C94_DMA will assert /DACK, which as earlier explained, will expose the FIFO to the data bus. Writing to the FIFO will decrement the DMA transfer counter.
Although the above appears to be an infinite loop, it will be broken shortly after a write to C94_DMA if the write causes the target device to change the bus phase, or if the DMA transfer count reaches zero.¹ When the count hits zero, the C94 will interrupt and report a “DMA transfer completed” status. In response, the SCSI portion of the interrupt service routine (ISR) will modify the address pushed to the stack by the 65C816 when it took the interrupt (using stack-relative addressing), which will result in the foreground part of the SCSI driver being redirected away from the transfer loop.
Reading from the SCSI bus follows a similar pattern:
Code: dmaread sep #%00100000 ;8-bit accumulator rep #%00010000 ;16-bit index ldy !#0 ;storage index ; .loop bit c94_stat ;can data be gotten? bpl .loop ;no, wait ; lda c94_dma ;fetch from C94 DMA port &... sta [ptr01],y ;store to RAM iny bra .loop ;next In this case, the DMA transfer counter will decrement with each fetch from C94_DMA and as with the previous write loop, a “DMA transfer completed” interrupt will occur when the transfer count reaches zero. Again, said interrupt will result in the foreground being redirected to another part of the driver, breaking the loop.
The described arrangement worked quite well with POC V1.1 running at 12.5 MHz and could consistently maintain a transfer rate of around 650 KB/second with multi-block transfers in either direction. Shortly after I had gotten this refined, I snagged a supply of NCR 53CF94 “fast SCSI controllers,” the CF94 being a souped-up version of the 53C94 and able to support asynchronous bus speeds of 5 MB/second, versus 3.5 MB/second maximum with the C94.
Multi-block transfer speeds increased to around 710 KB/second once I had modified the SCSI driver to take advantage of the CF94. During experimentation with the driver, I noted that the CF94 almost never deasserted DREQ once a transfer commenced, which likely accounted for the improved throughput.
However, using the CF94 also introduced some “operational anomalies” that occasionally messed up a read transfer...but never a write. I couldn’t see any obvious reason for it nor could I determine the precise nature of the error, but figured it was likely a system bus timing issue, or perhaps a consequence of not having I/O wait-stating. With no logic analyzer handy at the time and realizing that debugging with a logic probe and scope was not likely to shed any light on the problem, I returned to using the C94 and set the problem aside for another day.
Meanwhile, POC V1.3 had come off the drawing board and was stable at 16 MHz. I decided to design a host adapter to go with V1.3 (different mechanical layout than POC V1.1) and mostly copied the existing circuit. While I was at it, I added the jumper-selectable SCSI bus ID function to the new host adapter, rather than hard-coding the bus ID into the firmware. Since there was some room left over on the PCB, I incorporated a “SCSI bus active” indicator, basically a red LED driven by an inverter connected to one of the bus control lines—the LED would be lit any time the bus was in use. The new host adapter worked on the first try, and was runnable with an edited copy of V1.1’s SCSI driver (I subsequently rewrote the driver, as the original had more patches than a hobo’s trousers, and was a bit of a mess).
With the new hardware running in a stable fashion, I decided to put the CF94 back into use...only to again be confronted by “operational anomalies.” By now, I had a 32-channel logic analyzer at my disposal...it was interesting watching the SCSI bus gyrations... Anyhow, what was happening was every so often a read transfer (the above DMAREAD function) would drop the last byte. There was indeed a timing booby-trap lurking within, which had been present all along, but would not cause trouble with the C94, even when running on the slower POC V1.1. It took the higher performance of POC V1.3 and the CF94 to expose it.
Recall that when the DMA transfer count reaches zero, the C(F)94 will interrupt. In a read transaction, there is a narrow window of opportunity for the interrupt to “sneak in” just before the byte that was fetched gets stored in RAM. If that interrupt does sneak in, the final write will not occur, resulting in a dropped byte. Sneaky IRQs weren’t happening with the C94 because the time span from when the DMA counter reached zero to when the device interrupted was longer than the time required for the 65C816 to fetch the STA [PTR01],Y instruction’s opcode, which event would postpone any IRQ response. So the last byte would always get written before the IRQ broke the transfer loop.
As it turned out, the CF94 reacts more quickly than the C94 to events and thus can sneak in the interrupt before the 65C816 can fetch the STA [PTR01],Y instruction’s opcode. In fact, by increasing the CF94’s clock from 20 MHz to 25 MHz (the CF94 can support a 40 MHz clock), I could make the error consistently occur and, using the logic analyzer, could see that the CF94 asserted its /INT output mere nanoseconds after /DACK was deasserted following a fetch from the FIFO. The lag between /DACK being deasserted and /INT being asserted was small enough to cause the IRQ to hit very early in the Ø2 low phase, within the timing window where the 65C816 samples IRQB.
In an effort to deal with this problem, I modified the read function as follows:
Code: dmaread sep #%00100000 ;8-bit accumulator rep #%00010000 ;16-bit index ldy !#0 ;storage index ; .loop bit c94_stat ;can data be gotten? bpl .loop ;no, wait ; sei <————— lda c94_dma ;fetch from C94 DMA port &... sta [ptr01],y ;store to RAM cli <————— iny bra .loop ;next While the above modification addressed the timing problem, it did so at the expense of performance, as each loop iteration uses four additional Ø2 cycles. Four cycles doesn’t sound like much, until you consider that in a one-block disk read operation, that amounts to 2048 extra cycles. Adding to the fun, the POC V1.3 SCSI driver is able to read or write up to 64KB in a single transaction. A 64KB read will consume more than 260,000 clock cycles executing all those SEIs and CLIs in the loop. Ouch!
Although SCSI performance as it sits right now is pretty good, all that wasted processing time relentlessly bugs me, compelling me to find a way to reclaim those clock cycles. An approach would be to eliminate SEI and CLI from the loop and somehow postpone a CF94 “DMA transfer complete” interrupt long enough to guarantee that the final byte will be safely stored before the read loop is broken. As my thoughts evolved on this, it became clear that there should be some controlled propagation delay in the CF94’s /INT circuit, but only during a high-to-low transition. The low-to-high transition, on the other hand, should be as expeditious as possible to avoid the risk of a spurious interrupt.
For some reason, I had a hard time getting my head around the circuit details—a “senior moment,” as it were—and thus sought some suggestions, which primed the pump for me and led to a solution.
Attachment:
File comment: IRQ Delay Circuit
delayed_irq.jpg [ 34.44 KiB | Viewed 1723 times ]
The above circuit delays the propagation of the CF94’s /INT signal by about 5-or-so microseconds (assuming my math was correct in selecting the values for C1 and R2), which gives the 65C816 plenty of time to write the final byte to RAM during a read transaction, even at 1 MHz. The circuit will rapidly clear once /INT goes high, which should avoid a dreaded spurious IRQ.
I have designed a revised host adapter circuit and PCB layout to implement this change. The PCBs are on the way. More on this when I have built the host adapter.
Attachment:
File comment: New-Design SCSI Host Adapter Schematic
scsi_hba_schematic.pdf [91.29 KiB]
Downloaded 40 times
Attachment:
File comment: New-Design SCSI Host Adapter PCB
hba_delayed_irq.gif [ 42.52 KiB | Viewed 1723 times ]
———————————————————— ¹Under normal circumstances, a change-in-bus-phase IRQ will be coincident to a DMA-counter-reaching-zero IRQ—a bus-phase-change IRQ has priority, as it could occur if the target device experiences an error that prevents the completion of the transaction. ———————————————————— Edit: Fixed some typos.
_________________ x86? We ain't got no x86. We don't NEED no stinking x86!
Last edited by BigDumbDinosaur on Sat Aug 03, 2024 7:47 pm, edited 2 times in total.
|
|