POC Computer Version One

Rob Finch · Post by **Rob Finch** » Mon Nov 06, 2023 2:37 pm

kernelthread wrote:

Dr Jefyll wrote:

The complication is that Phi2 is usually what's used to qualify the /WE pulse that's sent to RAM and to non-65xx peripherals. And that's fine for devices that don't need extra time, because you want the rise of Phi2 to cause /WE to go low without delay. But in the case of a device that does need extra time, there could be trouble ("rogue" writes to incorrect addresses) if Phi2 causes /WE to go low without delay.

This seems to imply that by the time Phi2 rises we need to know whether or not extra time will be required. And, in an overclocked system, knowing this by the time Phi2 rises is a problem.

Okay, so this morning I'm doing a head slap.

The problem goes away if devices that need extra time don't use the same /WE pulse that's sent to fast devices. A little extra logic can give slow devices a /WE pulse of their own -- it would be a fairly minor addition to the clock-stretch circuit. Forgive me if that's what you already had in mind, kernelthread.

-- Jeff

To be honest, I hadn't realised this might be a problem. I was just considering a system consisting of RAM, ROM and VIA(s). RAM WE can be qualified with PHI2 as usual, as can WE for Flash memory if that is being used for ROM. VIAs are a problem because the chip select has to be asserted before the VIA input clock rises. But then you also don't really want to use PHI2 as the VIA input clock because the timers will be affected by clock stretching - you ideally want a constant clock input to the VIA so you can get predictable timing intervals.

I've come up with some Verilog code (not tested) to do what I was thinking of:

Code: Select all


input FCLK;			// fast clock, 2 x nominal CPU frequency
input nRST;			// system reset signal

reg [2:0] STATE;	// Clock stretch state (see below)
reg PHI2;			// clock to CPU
reg VIA_CLK;		// clock for VIAs and similar peripherals (half nominal CPU frequency, constant frequency)
reg VIA_SEL;		// chip select for VIAs
                    // asserted when VIA_CLK is low when VIA access is needed
                    // deasserted at following VIA_CLK falling edge
reg DELAYED_SEL;    // delayed chip select for any other purpose
                    // asserted at beginning of first stretch cycle
                    // deasserted at end of bus cycle
                    // maybe useful if any device requires more settling time before RD/WR strobes activated

/* STATE values:

000     First half cycle of every bus cycle, PHI2 = 0
001     Second half cycle of every bus cycle, PHI2 = 1. Last half cycle if no stretching needed.
010     6 input clocks (3 nominal CPU cycles) remaining, PHI2 = 1
011     5 input clocks (2.5 nominal CPU cycles) remaining, PHI2 = 1
100     4 input clocks (2 nominal CPU cycles) remaining, PHI2 = 1
101     3 input clocks (1.5 nominal CPU cycles) remaining, PHI2 = 1
110     2 input clocks (1 nominal CPU cycles remaining, PHI2 = 1
111     Last half cycle following stretch, PHI2 = 1

*/

// These logic terms should be capable of evaluation within one clock period of FCLK
// It is assumed they are functions of the active address, so remain asserted for the entire bus cycle
wire STRETCH1 = ... ;	// insert logic to determine if device needs stretch by 1 cycle
wire STRETCH2 = ... ;	// insert logic to determine if device needs stretch by 2 cycles
wire STRETCH3 = ... ;	// insert logic to determine if device needs stretch by 3 cycles
wire IS_VIA = ... ;		// insert logic to determine if it's a VIA

always @(posedge FCLK or negedge nRST) begin
	if (!nRST) begin
		STATE <= 3'b000;
		PHI2 <= 1'b0;
		VIA_CLK <= 1'b0;
		VIA_SEL <= 1'b0;
		DELAYED_SEL <= 1'b0;
	end
	else if (STATE == 3'b000) begin
        // CPU clock is low in state 000, high in any other state
		STATE <= 3'b001;
		PHI2 <= 1'b1;
	end
    else if (STATE == 3'b001) begin
        // Second state of bus cycle - need to determine required wait states
        // by the end of this state.
        if (STRETCH1) begin
            // state increments twice before PHI2 goes low again
            STATE <= 3'b110;

            // delayed select asserted at beginning of stretch
		    DELAYED_SEL <= 1'b1;
        end
        else if (STRETCH2) begin
            // state increments 4 times before PHI2 goes low again
            STATE <= 3'b100;

            // delayed select asserted at beginning of stretch
		    DELAYED_SEL <= 1'b1;
        end
        else if (STRETCH3) begin
            // state increments 6 times before PHI2 goes low again
            STATE <= 3'b010;

            // delayed select asserted at beginning of stretch
		    DELAYED_SEL <= 1'b1;
        end
        else if (IS_VIA) begin
            // VIA_CLK will toggle at the next input clock, however it is not
            // known whether it will go low or high
            // If it will go low, assert VIA_SEL at the same time and do a 2 cycle stretch
            // If it will go high, don't assert VIA_SEL yet, but start a 3 cycle stretch
            // VIA_SEL will be asserted after 1 cycle of stretch
            if (VIA_CLK) begin
                // VIA_CLK will go low at next input clock, so assert VIA_SEL there as well
                VIA_SEL <= 1'b1;

                // need 2 cycle stretch
                STATE <= 3'b100;
            end
            else begin
                // need 3 cycle stretch, don't assert VIA_SEL yet
                STATE <= 3'b010;
            end
        end
        else begin
            // no stretching needed - back to PHI2=0
            STATE <= 3'b000;
		    PHI2 <= 1'b0;
        end
    end
    else if (STATE == 3'b111) begin
        // Last state of bus cycle following stretching
        // Back to PHI2=0 on next input clock
        STATE <= 3'b000;
		PHI2 <= 1'b0;

        // VIA select deasserted here
        VIA_SEL <= 1'b0;

        // delayed select deasserted here
		DELAYED_SEL <= 1'b0;
    end
    else begin
        // stretching operation underway but not yet at the end
        // states 010, 011, 100, 101, 110
        if ((STATE == 3'b011) & IS_VIA) begin
            // VIA_CLK will go low at next input clock, so assert VIA_SEL there as well
            VIA_SEL <= 1'b1;
        end

        STATE <= STATE + 3'b001;
    end

    if (STATE[0]) begin
        // VIA_CLK toggles whenever unstretched clock would fall
        // i.e. when state number transitions from odd to even
        VIA_CLK <= ~VIA_CLK;
    end
end

I added a DELAYED_SEL signal to accommodate any devices which need additional settling time for the address and/or data before being activated, following your suggestion.

First, I am sot sure if this makes a difference to a CPLD but in an FPGA it may.

This is a bit of nit-picking on Verilog code, but using case statements instead of if/else if logic may generate faster code that uses fewer resources.
If/else if statements generate a priority encoder which may be cascaded logic, while case statements use decoders which often have support built into the PLD.

Altered (untested) code shown below.

Code: Select all

always @(posedge FCLK or negedge nRST) begin
	if (!nRST) begin
		STATE <= 3'b000;
		PHI2 <= 1'b0;
		VIA_CLK <= 1'b0;
		VIA_SEL <= 1'b0;
		DELAYED_SEL <= 1'b0;
	end
	else begin
		// We always want the next state unless overridden.
    STATE <= STATE + 3'b001;

   	case(STATE)
   	3'b000:
   		begin
        // CPU clock is low in state 000, high in any other state
        //STATE <= 3'b001;
  	    PHI2 <= 1'b1;
   		end
   	3'b001:
    	begin
    		casez({STRETCH1,STRETCH2,STRETCH3,IS_VIA})
        // Second state of bus cycle - need to determine required wait states
        // by the end of this state.
        4'b1???:
        	begin
            // state increments twice before PHI2 goes low again
            STATE <= 3'b110;

            // delayed select asserted at beginning of stretch
          	DELAYED_SEL <= 1'b1;
        	end
        4'b01??:
        	begin
            // state increments 4 times before PHI2 goes low again
            STATE <= 3'b100;

            // delayed select asserted at beginning of stretch
          	DELAYED_SEL <= 1'b1;
        	end
        4'b001?:
        	begin
            // state increments 6 times before PHI2 goes low again
//            STATE <= 3'b010;

            // delayed select asserted at beginning of stretch
          	DELAYED_SEL <= 1'b1;
          end
        4'b0001:
            // VIA_CLK will toggle at the next input clock, however it is not
            // known whether it will go low or high
            // If it will go low, assert VIA_SEL at the same time and do a 2 cycle stretch
            // If it will go high, don't assert VIA_SEL yet, but start a 3 cycle stretch
            // VIA_SEL will be asserted after 1 cycle of stretch
            if (VIA_CLK) begin
                // VIA_CLK will go low at next input clock, so assert VIA_SEL there as well
                VIA_SEL <= 1'b1;

                // need 2 cycle stretch
                STATE <= 3'b100;
            end
            else begin
                // need 3 cycle stretch, don't assert VIA_SEL yet
//                STATE <= 3'b010;
            end
        default:
        	begin
            // no stretching needed - back to PHI2=0
            STATE <= 3'b000;
          	PHI2 <= 1'b0;
          end
        endcase
    	end
    3'b111:
    	begin
        // Last state of bus cycle following stretching
        // Back to PHI2=0 on next input clock
//        STATE <= 3'b000;
      	PHI2 <= 1'b0;

        // VIA select deasserted here
        VIA_SEL <= 1'b0;

        // delayed select deasserted here
      	DELAYED_SEL <= 1'b0;
    	end
    3'b011:
    	begin
        // stretching operation underway but not yet at the end
        // states 010, 011, 100, 101, 110
        if (IS_VIA) begin
            // VIA_CLK will go low at next input clock, so assert VIA_SEL there as well
            VIA_SEL <= 1'b1;
        end
    default:	;
    endcase

    end

    if (STATE[0]) begin
        // VIA_CLK toggles whenever unstretched clock would fall
        // i.e. when state number transitions from odd to even
        VIA_CLK <= ~VIA_CLK;
    end
end

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Jan 09, 2024 9:22 pm

POC V1.3 has achieved 110 days of uptime, as I monkey around with a new-fangled S-record loader that will include some features I originally omitted, e.g., being able to process S2 and S8 records. Those normally wouldn't be used with a 65C02, but are useful with the 65C816, since a program can be assembled to run in extended RAM.

Just for reference, attached is the assembly listing for V1.3’s firmware. Someone might find something useful in it.

firmware_2_7_0.txt: POC V1.3’s Firmware, V2.7.0; (682.81 KiB) Downloaded 272 times

Proxy · Post by **Proxy** » Wed Jan 10, 2024 7:12 am

Personally I really like the PGZ format for loading programs over serial as its very easy to write a loader for compared to a lot of other formats (atleast for me).

When I setup the gcc toolchain for my 68000 SBC a month ago or so it obviously didn't have PGZ as an output option. But i didn't want to write an SREC loader in 68k assembly, so i did the only logical thing: I learned the ELF file format, and wrote a desktop program to convert an ELF executable into a PGZ file.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 10, 2024 8:17 am

Proxy wrote:

Personally I really like the PGZ format for loading programs over serial as its very easy to write a loader for compared to a lot of other formats (atleast for me).

After reading about PGZ (and PGX), I’m unimpressed. Neither has any apparent error-detection provisions, and both appear to be one-off formats. The S-record format has record checksumming, plus additional mechanisms for error detection, and is a de facto standard. Also, like Intel hex and other similar formats, S-record is pure ASCII, making it suitable for transmission through a medium that wouldn’t gracefully handle binary data. I’m not seeing any compelling reason to monkey with PGZ.

BigEd · Post by **BigEd** » Wed Jan 10, 2024 8:23 am

Elsewhere, BDD, I'm pretty sure you've said short-range serial connections are reliable. So we shouldn't expect any corruption on the wire. Having said that, small systems like ours might often initially have problems with flow control or overrun, so a final checksum might be a good idea.

Proxy · Post by **Proxy** » Wed Jan 10, 2024 10:19 am

I've never really had a need or saw a reason for exact line by line error checking in a hobbyist environment where wire runs are only a few meters at most and any data corruption can be solved by just resetting the system and/or resending the file.
And I actually like that it's not ASCII, makes the file size smaller and loading a bit faster as you don't need any kind of conversion. And what do you mean with serial not handling raw binary well? I've never hd any issues with that.

Though I guess a complete checksum at the end (or start) of the file would be a good addition, I have created a slightly modified version of PGZ called PGW which adds an endianess option, so I could just add a 32-bit checksum in the header and make that part of the format.

drogon · Post by **drogon** » Wed Jan 10, 2024 10:24 am

Sorry - can't resist adding this at this point: https://xkcd.com/927/

-Gordon

Proxy · Post by **Proxy** » Wed Jan 10, 2024 11:00 am

Ah good old XKCD.
But the thing is I dont really see these as "competing" standards, just as options. So anyone can choose whatever format fits their requirements and use case.
For example, I don't need error checking and just wanted something easy to parse and fast, so PGZ turned out to be the best option for me. I didn't mean to imply that it's perfect for every case.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 10, 2024 7:51 pm

BigEd wrote:

Elsewhere, BDD, I'm pretty sure you've said short-range serial connections are reliable. So we shouldn't expect any corruption on the wire.

That is true, assuming the hardware driving the link is properly performing. I know from past experience that high-quality hardware can reliably drive a TIA-232 link at 115.2Kbps over 50 meters, assuming the use of CAT5 UTP and the link being kept away from sources of strong, fluctuating magnetism (e.g., large AC motors), EMI and ESI. At the distances we use in our hobby, I’d expect very little signal degradation.

Quote:

Having said that, small systems like ours might often initially have problems with flow control or overrun, so a final checksum might be a good idea.

Agreed. Overrun is always a risk, especially if using the 65C51 or similar. I can deliberately induce overrun in my POC units during data transfer from my development server, despite the 28L92’s 16-deep FIFOs. All that is needed is a slow Ø2 clock.

Without at least a basic checksum process, how would it be known if a datum got clobbered?

Proxy wrote:

I've never really had a need or saw a reason for exact line by line error checking in a hobbyist environment where wire runs are only a few meters at most and any data corruption can be solved by just resetting the system and/or resending the file.

As said above, the length of a TIA-232 link in a hobby environment is seldom a factor in reliable transmission. Problems with the receiving station are more likely to result in errors, as Ed notes. I personally would find it really annoying to have to start all over just because of a one-byte error.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Jul 05, 2024 8:29 am

Many pages ago, I described the SCSI host adapter I built to attach mass storage to my POC units. First a little history...

The core of the original host adapter was the 53C94 “advanced SCSI controller,” an NCR device that was second-sourced by AMD and others, and was used in the host adapters of a number of different systems. The C94 is a member of a family of NCR SASI/SCSI controllers that dates back to the early 1980s, well before SCSI had been defined as a formal standard. The C94 has considerable intelligence, offloading virtually all aspects of the bus protocol from the host system. That intelligence makes it possible to support SCSI without the need for the driver having to know much about the bus protocol.

A project of mine from the early 1990s used the C94’s ancestor, the 53C90A, to implement SCSI on a unit powered by a 65C02 running at 8 MHz. The C90A supported the then-nascent ANSI SCSI-2 standard (1989), as well as the older ANSI SCSI-1 standard (1986), and the still-older SASI bus developed by Shugart Associates in 1979 and first commercially supported by NCR in 1981. Using the C90A and some reasonably efficient programming, I was able to achieve a respectable transfer rate of around 350 KB/second with this setup, about 10 percent of the maximum raw, asynchronous SCSI bus speed supported by the C90A.

When I decided to add SCSI to POC V1, I initially tried to source the 53C90A to build the host adapter. An NOS liquidator that I had contacted advised me that they didn’t have the C90A, but did have the NCR 53C94, which can be described as a 53C90A on steroids. After studying the C94’s data sheet, I decided to design my host adapter around it.

The C94 is able to support programmed I/O (PIO) or direct memory access (DMA) transfers between the SCSI bus and the host system. In studying the C94’s data sheet, it soon became patent use of PIO is to be avoided, which was subsequently proved in reality during early experimentation. When operated in PIO mode, the C94 will generate an interrupt for each byte processed during any information transfer bus phase. Case in point: a typical hard disk block is 512 bytes. Hence 512 interrupts will be generated to read or write one block, plus other interrupts will generated as the SCSI bus changes phases and the host adapter and target device exchange status information.

Given the 15 bytes of overhead associated with the MPU responding to an interrupt, performance will suck. 7,680 clock cycles will be consumed in interrupt overhead alone to transfer one block—and then there will be the 20-or-so clock cycles needed to actually fetch and store each byte. I won’t even mention the preamble and postamble parts of the interrupt service routine that will be executed 512 times per block read or written.

On the other hand, when using DMA with the C94, the device will interrupt only when all bytes have been transferred or a bus phase change has occurred, reducing a typical SCSI transaction to no more than four or five interrupts. Almost all processing time will be expended in actual data transfer, which can be done quite efficiently with the right code.

The problem is in implementing DMA with the C94. I’m not aware of the existence of a DMA controller that is adaptable to a 65C816 system running in the 16+ MHz range. It might be possible to make a CPLD act as a DMA controller, but my CPLD programming skills aren’t at that level.

I had considered the possibility of rigging up a 65C02 as a sort-of DMA controller, taking advantage of the SOB input to quickly recognize when the C94 is ready to accept or provide some data. I also conjured another method using IRQB rigged up in a way that would trigger the byte transfer without sustaining the overhead of processing an interrupt (the SEI - WAI trick). I even thought of using an 816 as the DMA “controller,” along with some clock trickery for pacing transfers. After pondering how two MPUs would co-exist on the same bus (not simple) and doing myriad cycle counts to estimate performance, I concluded it would be a lot of work for, at best, a 2-to-1 speedup.

That said, I was bound and determined to figure out how to avoid doing PIO, which led me to conclude I could arrange the host adapter so the 65C816 could do “pretend DMA” in software. Performance would be nowhere near what could be achieved with a real DMA controller, but would be a lot faster than managing the IRQ storm that PIO operation would produce.

The C94 has 16 registers in total, which are accessed through conventional decoding in the POC unit’s I/O block. Like most I/O chips, the C94 has a chip select (/CS) input, which along with its A0-A3 inputs and the status of the /RD and /WD inputs, exposes a register to the data bus. Most registers are used to configure the device and report status. A single register, the “FIFO,” acts as an intermediary between the host system and the SCSI bus. Succinctly, during an information transfer bus phase, incoming data are read from the FIFO and outgoing data are written to the FIFO. In general, the host system doesn’t need to be concerned with how the C94 communicates with the SCSI bus, as bus access handshaking is completely automatic.

The interesting thing about the FIFO is it is accessible in one of two ways. If /CS is asserted and the bit pattern %0010 is placed on A3-A0, the FIFO can be read or written, same as other registers in the C94. This is the method used when PIO is employed to read from or write to a SCSI device. As earlier noted, each such access will cause the C94 to interrupt.

The C94 also has a DMA port that directly communicates with the FIFO. A separate C94 select input, /DACK, will expose the FIFO to the data bus—/CS is not used at all and in fact, must never be asserted when /DACK is asserted (experimentation with doing so caused major chaos). /DACK is used in conjunction with a C94 output called DREQ to form a handshaking setup that normally would be wired to a DMA controller.

During a DMA read operation, the C94 will assert DREQ when at least one byte is available in the FIFO. The DMA controller would respond by asserting /DACK to expose the FIFO to the data bus, fetching the waiting byte, releasing /DACK, and storing the byte in RAM. If more data is available in the FIFO, the C94 will again assert DREQ and the sequence can be repeated. As previously noted, the C94 takes care of synchronizing SCSI bus activity with data outflow from the FIFO—the DMA controller only has to fetch from the FIFO when DREQ is asserted.

During a DMA write operation, the C94 will assert DREQ if it can accept a byte—I should note the FIFO is 16-deep. The DMA controller would respond by fetching a byte from RAM, asserting /DACK to expose the FIFO, writing the byte into the FIFO and then releasing /DACK. If the C94 is able to accept more data, it will again assert DREQ. As with a read operation, SCSI bus synchronization is handled by the C94, and the DMA controller need not be concerned with it.

As part of the DMA setup, a register pair in the C94, the DMA transfer counter, must be loaded with a count of the bytes that are to be transferred, usually 512 per hard disk block. This counter is decremented each time /DACK is toggled. When the count reaches zero, the C94 will generate an interrupt. In a system with an actual DMA controller, that interrupt will be the signal to the microprocessor that the transfer has completed. In my “pretend DMA” setup, that interrupt is used in the driver software to route foreground execution, which design feature inadvertently created an obscure timing issue about which I will shortly bloviate.

In my initial work to adapt SCSI to POC V1.0, I started by using PIO to verify that my basic chip-select logic and I/O routines were functional. During that phase of experimentation, it became painfully apparent how slow PIO operation is.

With that out of the way, the next step was to get “pretend DMA” working. This entailed assigning three different decoded I/O pages to the host adapter, one for general register access, a second page to directly access the DMA port by asserting /DACK, and a third one to fetch status information, which initially was the state of DREQ.

In POC V1.3, the current iteration of the POC series, the C94’s /CS will be asserted if there is an access to the $00C4xx I/O page, which gives register access according to the bit pattern on A0-A3. An access to any address in the $00C5xx I/O page will assert /DACK, hence exposing the FIFO to the data bus. With either access method, the operation to be performed is determined by the state of /RD and /WD, both signals being qualified by Ø2 in usual 6502 fashion.

Generating host adapter status is a little more involved. Part of the host adapter circuitry is a 74ACT541 bus driver, which is wired so a read access of any address in the $00C6xx I/O page will connect the ACT541 to the data bus. Bit 7 reflects the state of DREQ, and in the second generation host adapter design, which this new host adapter design replaces, bits 2-0 indicate the host adapter’s SCSI bus ID, which is jumper-settable (the bus ID is read during POST and used to configure the C94’s notion of its bus ID). Bits 3-6 are unused and always return 0. By having DREQ appear as bit 7, it is easily tested with the BIT instruction.

With this arrangement, the code to produce a “pretend DMA” transfer is quite simple, once preliminary operations have configured the C94 as required and the target device (disk, tape, etc.) has been selected and has responded. For example, writing to the SCSI bus is as follows (not the exact code, but functionally identical):

Code: Select all

dmawrit  sep #%00100000        ;8-bit accumulator
         rep #%00010000        ;16-bit index
         ldy !#0               ;storage index
;
.loop    bit c94_stat          ;can data be accepted?
         bpl .loop             ;no
;
         lda [ptr01],y         ;fetch from RAM &...
         sta c94_dma           ;store to C94 DMA port
         iny
         bra .loop             ;next

Using BIT on C94_STAT reports the status of DREQ, which will be low if the C94 is not ready to accept a byte, thus implementing the required handshaking. As soon as DREQ goes true, a byte can be written to the DMA port. An access to C94_DMA will assert /DACK, which as earlier explained, will expose the FIFO to the data bus. Writing to the FIFO will decrement the DMA transfer counter.

Although the above appears to be an infinite loop, it will be broken shortly after a write to C94_DMA if the write causes the target device to change the bus phase, or if the DMA transfer count reaches zero.¹ When the count hits zero, the C94 will interrupt and report a “DMA transfer completed” status. In response, the SCSI portion of the interrupt service routine (ISR) will modify the address pushed to the stack by the 65C816 when it took the interrupt (using stack-relative addressing), which will result in the foreground part of the SCSI driver being redirected away from the transfer loop.

Reading from the SCSI bus follows a similar pattern:

Code: Select all

dmaread  sep #%00100000        ;8-bit accumulator
         rep #%00010000        ;16-bit index
         ldy !#0               ;storage index
;
.loop    bit c94_stat          ;can data be gotten?
         bpl .loop             ;no, wait
;
         lda c94_dma           ;fetch from C94 DMA port &...
         sta [ptr01],y         ;store to RAM
         iny
         bra .loop             ;next

In this case, the DMA transfer counter will decrement with each fetch from C94_DMA and as with the previous write loop, a “DMA transfer completed” interrupt will occur when the transfer count reaches zero. Again, said interrupt will result in the foreground being redirected to another part of the driver, breaking the loop.

The described arrangement worked quite well with POC V1.1 running at 12.5 MHz and could consistently maintain a transfer rate of around 650 KB/second with multi-block transfers in either direction. Shortly after I had gotten this refined, I snagged a supply of NCR 53CF94 “fast SCSI controllers,” the CF94 being a souped-up version of the 53C94 and able to support asynchronous bus speeds of 5 MB/second, versus 3.5 MB/second maximum with the C94.

Multi-block transfer speeds increased to around 710 KB/second once I had modified the SCSI driver to take advantage of the CF94. During experimentation with the driver, I noted that the CF94 almost never deasserted DREQ once a transfer commenced, which likely accounted for the improved throughput.

However, using the CF94 also introduced some “operational anomalies” that occasionally messed up a read transfer...but never a write. I couldn’t see any obvious reason for it nor could I determine the precise nature of the error, but figured it was likely a system bus timing issue, or perhaps a consequence of not having I/O wait-stating. With no logic analyzer handy at the time and realizing that debugging with a logic probe and scope was not likely to shed any light on the problem, I returned to using the C94 and set the problem aside for another day.

Meanwhile, POC V1.3 had come off the drawing board and was stable at 16 MHz. I decided to design a host adapter to go with V1.3 (different mechanical layout than POC V1.1) and mostly copied the existing circuit. While I was at it, I added the jumper-selectable SCSI bus ID function to the new host adapter, rather than hard-coding the bus ID into the firmware. Since there was some room left over on the PCB, I incorporated a “SCSI bus active” indicator, basically a red LED driven by an inverter connected to one of the bus control lines—the LED would be lit any time the bus was in use. The new host adapter worked on the first try, and was runnable with an edited copy of V1.1’s SCSI driver (I subsequently rewrote the driver, as the original had more patches than a hobo’s trousers, and was a bit of a mess).

With the new hardware running in a stable fashion, I decided to put the CF94 back into use...only to again be confronted by “operational anomalies.” By now, I had a 32-channel logic analyzer at my disposal...it was interesting watching the SCSI bus gyrations...

Anyhow, what was happening was every so often a read transfer (the above DMAREAD function) would drop the last byte. There was indeed a timing booby-trap lurking within, which had been present all along, but would not cause trouble with the C94, even when running on the slower POC V1.1. It took the higher performance of POC V1.3 and the CF94 to expose it.

Recall that when the DMA transfer count reaches zero, the C(F)94 will interrupt. In a read transaction, there is a narrow window of opportunity for the interrupt to “sneak in” just before the byte that was fetched gets stored in RAM. If that interrupt does sneak in, the final write will not occur, resulting in a dropped byte. Sneaky IRQs weren’t happening with the C94 because the time span from when the DMA counter reached zero to when the device interrupted was longer than the time required for the 65C816 to fetch the STA [PTR01],Y instruction’s opcode, which event would postpone any IRQ response. So the last byte would always get written before the IRQ broke the transfer loop.

As it turned out, the CF94 reacts more quickly than the C94 to events and thus can sneak in the interrupt before the 65C816 can fetch the STA [PTR01],Y instruction’s opcode. In fact, by increasing the CF94’s clock from 20 MHz to 25 MHz (the CF94 can support a 40 MHz clock), I could make the error consistently occur and, using the logic analyzer, could see that the CF94 asserted its /INT output mere nanoseconds after /DACK was deasserted following a fetch from the FIFO. The lag between /DACK being deasserted and /INT being asserted was small enough to cause the IRQ to hit very early in the Ø2 low phase, within the timing window where the 65C816 samples IRQB.

In an effort to deal with this problem, I modified the read function as follows:

Code: Select all

dmaread  sep #%00100000        ;8-bit accumulator
         rep #%00010000        ;16-bit index
         ldy !#0               ;storage index
;
.loop    bit c94_stat          ;can data be gotten?
         bpl .loop             ;no, wait
;
         sei          <—————
         lda c94_dma           ;fetch from C94 DMA port &...
         sta [ptr01],y         ;store to RAM
         cli          <—————
         iny
         bra .loop             ;next

While the above modification addressed the timing problem, it did so at the expense of performance, as each loop iteration uses four additional Ø2 cycles. Four cycles doesn’t sound like much, until you consider that in a one-block disk read operation, that amounts to 2048 extra cycles. Adding to the fun, the POC V1.3 SCSI driver is able to read or write up to 64KB in a single transaction. A 64KB read will consume more than 260,000 clock cycles executing all those SEIs and CLIs in the loop. Ouch!

Although SCSI performance as it sits right now is pretty good, all that wasted processing time relentlessly bugs me, compelling me to find a way to reclaim those clock cycles. An approach would be to eliminate SEI and CLI from the loop and somehow postpone a CF94 “DMA transfer complete” interrupt long enough to guarantee that the final byte will be safely stored before the read loop is broken. As my thoughts evolved on this, it became clear that there should be some controlled propagation delay in the CF94’s /INT circuit, but only during a high-to-low transition. The low-to-high transition, on the other hand, should be as expeditious as possible to avoid the risk of a spurious interrupt.

For some reason, I had a hard time getting my head around the circuit details—a “senior moment,” as it were—and thus sought some suggestions, which primed the pump for me and led to a solution.

: IRQ Delay Circuit

The above circuit delays the propagation of the CF94’s /INT signal by about 5-or-so microseconds (assuming my math was correct in selecting the values for C1 and R2), which gives the 65C816 plenty of time to write the final byte to RAM during a read transaction, even at 1 MHz. The circuit will rapidly clear once /INT goes high, which should avoid a dreaded spurious IRQ.

I have designed a revised host adapter circuit and PCB layout to implement this change. The PCBs are on the way. More on this when I have built the host adapter.

scsi_hba_schematic.pdf: New-Design SCSI Host Adapter Schematic; (91.29 KiB) Downloaded 174 times

: New-Design SCSI Host Adapter PCB

————————————————————
¹Under normal circumstances, a change-in-bus-phase IRQ will be coincident to a DMA-counter-reaching-zero IRQ—a bus-phase-change IRQ has priority, as it could occur if the target device experiences an error that prevents the completion of the transaction.
————————————————————
Edit: Fixed some typos.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Jul 19, 2024 6:26 am

I have received the PCBs for the new host adapter and have commenced the build. It should be ready to go in a day or two.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jul 24, 2024 5:35 am

SCSI host adapter is built and works. In the below picture, the previous-design host adapter is on the left and the new-and-improved™ version is to the right. As with the previous-design adapter, the new one is clocked at 25 MHz and is theoretically capable of supporting a maximum SCSI bus speed of 5 MB/second in synchronous mode (the 53CF94 can support a synchronous bus speed of 10 MB/second if clocked at 40 MHz). In my application, the bus is run asynchronously, with a maximum speed of 3.5 MB/second, which is well beyond the speed at which the 65C816 powering POC V1.3 can copy bytes to/from the host adapter.

Next step will be to hook up the logic analyzer and observe the behavior of the host adapter’s delayed-IRQ circuit. Assuming it is working as designed, the next step will be to modify the SCSI driver code to take advantage of the new-and-improved™ hardware and, it is hoped, increase the transfer rate.

: Old & New Host Adapters, New on the Right

GlennSmith · Post by **GlennSmith** » Wed Jul 24, 2024 3:04 pm

Hey, BDD - eagerly awaiting the results!
Thanks for the write-up of your SCSI system.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Aug 03, 2024 7:20 am

I finally got around to hooking up the logic analyzer to the new SCSI host adapter to observe the behavior of the latter’s interrupt circuit. Recall from above that this host adapter is designed to delay its IRQ so as to give the MPU sufficient time to complete the last store instruction during a read operation. Below is the annotated capture from the logic analyzer showing the behavior of the IRQ delay feature:

: Delayed IRQ Capture

The delay isn’t as long as I had calculated when I was designing the host adapter because I made the assumption that two R-C time-constants would be required to “trip” the Schmitt inverter that drives the host adapter’s /IRQ output. It appears only one time-constant is needed to do that. Even so, the 2.72 µsec delay is more than sufficient to accomplish what is needed, even if the Ø2 clock rate is only 1 MHz (which it never is

).

Incidentally, the measured propagation delay through the Schmitt inverter is 8 ns, roughly about midway in the device’s specs.

I have some captures I will post later on that show other host adapter activity that was of interest to me.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Aug 14, 2024 7:38 pm

While I had the logic analyzer connected to POC V1.3’s new-and-improved™ SCSI host adapter, I decided to capture what was going on during a SCSI transaction, as will be seen below.

The version of Supermon 816 that runs in V1.3’s firmware includes the ability to issue some basic commands to the SCSI subsystem. For example, it’s possible to low-level format a disk, read or write data from a block device’s medium or get status information from a device, the last by issuing a “request sense” command to the device in question.

As implemented in POC V1.3’s firmware, “request sense” returns 18 bytes of status information to the host, which makes the command useful for initiating a SCSI transaction for observational purposes. Accordingly, I ran a “request sense” transaction while the logic analyzer was hooked up to watch key host adapter signals and what they were doing as the transaction progressed.

Before getting to the logic analyzer capture details, I will give an overview of the SCSI subsystem.

POC V1.3’s SCSI subsystem is an eight-bit, parallel bus topology, consisting of a host bus adapter (HBA) mounted on V1.3, several SCSI devices and an interconnecting cable, the latter which is Thevenin-terminated at both ends. Being an eight-bit bus, the maximum number of supported devices is seven, not including the HBA. The SCSI subsystem mostly implements the ANSI SCSI-2 1994 standard, with reselection, synchronous data transfer, and tagged operation queuing not supported due to limitations in the POC V1.3 hardware.

The heart of the HBA is the 53CF94 (CF94) “fast” SCSI controller, which is wired directly to and thus drives the SCSI bus. The CF94 has sufficient intelligence to autonomously sequence the complex SCSI bus protocol in response to simple commands, as well as recognize changing bus conditions in real time. Working with the CF94 is a 74ACT541 (541) bus driver¹ (U2 in the HBA schematic) that is used to provide real-time, HBA status information as a SCSI transaction is processed. The 541 looks like read-only hardware to the MPU when selected.

Traditionally, devices connected to the SCSI bus are decribed in terms of “initiators” and “targets.” An initiator begins a SCSI transaction by attempting to establish communication with a target. Once a target has responded, the bus will be sequenced in various ways to carry out a transaction, setting up what are referred to as bus “phases.” ANSI SCSI-2 1994 defines eight such phases, of which seven are used in this implementation:

Bus-free. Bus-free is the condition in which no activity is taking place. All SCSI devices are in the high-impedance state, which means the bus is floating.² SCSI transactions always commence from the bus-free phase. Once the bus-free state has been established, the HBA, commanded by the SCSI driver, can initiate a transaction.
Arbitration. As any device on a SCSI bus can theoretically act as an initiator, arbitration is used to decide which device will get control. The device that “wins” arbitration will take control and initiate the next bus phase, which will be “selection.” All other devices will become quiescent.
Selection. During this phase, which immediately follows arbitration, the initiator will attempt to connect to the target. The target must respond to selection within 250 milliseconds, after which the initiator will release the bus and report a fatal error. If selection is successful, the target will take control of the bus and will change the bus to the command phase (described next).
Command. During this phase, the initiator will send a data structure referred to as a “command descriptor block” (CDB) to the target. The CDB tells the target what operation is being requested, and includes any parameters needed to carry out the operation. The target will analyze the CDB content and then change the bus to one of the “information transfer phases,” which are described next. The particular phase chosen will depend on whether the target was able to decipher the CDB and if so, whether it can execute the command.
Data. During this phase, data will be transferred between initiator and target, the direction of transfer determined by whether the command is a read operation (target to initiator) or a write operation (initiator to target). The terms “data-in” and “data-out” refer to this phase, with data-in being a read operation. An example of a data-in operation would be fetching a block of data from a disk’s medium.
Status. During this phase, the target will send a status byte to the initiator in response to the completion status of the most recent command. The two most-common statuses sent are “okay” and “check condition,” the latter indicating the target encountered a problem during command execution.
Message-in. During this phase, the target will send a message to the initiator. Messages are used for bus-management and error-recovery purposes. For example, after processing a command, a target may send a “command complete” message.
Message-out. During this phase, the initiator will send a message to the target. This bus phase is supported by the SCSI driver, but is not currently in use.

Information transfer phases can occur in any order, and it is possible that the bus-free phase could occur without warning. For this reason, the SCSI driver software is organized around the bus phase concept. With each bus phase change, the CF94 will generate an IRQ that will be processed by the background part of the driver. That processing will clear the IRQ, collect status information and return it to the foreground. The foreground will then analyze the status information and dispatch execution as required. Each information transfer bus phase has a dedicated module that understands what is required to process that phase. Transfers between the HBA and RAM are carried out using the “pretend DMA” procedure I described in an earlier post.

What follows are several captures of HBA signals, the first of driver operations as a “request sense” command is executed.

: “Request Sense” Sequence

First, a signal explanation:

53CF94 /INT. This signal is the open-collector interrupt output of the CF94 bus controller.
Inverted /INT. This is the inversion of 53CF94 /INT. It appears as bit 6 when the HBA’s status is fetched by the MPU.
65C816 /IRQ. This is the interrupt signal actually seen by POC V1.3’s hardware. As described in my previous post, /IRQ is made to lag 53CF94 /INT to avoid a potential timing race condition in the SCSI driver during read operations.
DMA Request (DREQ). The CF94 has a 16-deep FIFO that acts as a bi-directional data path between the SCSI bus and the host system. When a direct memory access (DMA) read operation is in progress, DREQ will be asserted any time there is at least one datum in the FIFO. When a DMA write operation is in progress, DREQ will be asserted if there is room in the FIFO for another datum. DREQ appears as bit 7 when the HBA’s status is fetched by the MPU.
DMA Acknowledge (/DACK). During a DMA read or write operation, /DACK will be asserted when the HBA’s DMA port is accessed by the MPU. The result is the aforementioned FIFO will be connected to the data bus and the MPU can read or write it as required. /DACK is toggled for each byte that is to be read or written. Toggling /DACK will also decrement the DMA transfer counter register in the CF94, which is programmed with the number of bytes to be transferred during a read or write operation. The coordinated usage of DREQ and /DACK handshakes all DMA activity.

The above logic analyzer capture shows how things progressed during the processing of a “request sense” command. The capture begins after the target has responded to selection, the bus has entered the command phase, the CDB has been sent and the target has deciphered it. As the operation specified by the CDB is acceptable to the target, the latter switched the bus to the data phase. Had the target been unable to decipher the CDB or had the operation specified by the CDB been one not supported by the target, the bus phase would have been switched to status and the target would have reported “check condition” to indicate the error.

The bus phase change IRQ directs the SCSI driver to commence executing its data-in code. Some configuration is done to the CF94 and then it is commanded to start the data transfer sequence, which will result in a bus handshake with the target to tell the latter to send some bytes. DREQ will be asserted by the CF94 as bytes arrive from the target and collect in the FIFO. As long as DREQ remains asserted, the driver will toggle /DACK to fetch a byte from the FIFO. This process will be repeated until DREQ is deasserted.

As earlier noted, “request sense” returns 18 bytes during data-in, which is apparent in this capture; /DACK was toggled 18 times before DREQ was deasserted. I should also add that the ANSI SCSI-2 1994 standard says up to 65,535 blocks may be accessed in one transaction when reading or write a disk’s medium. My driver is not quite that ambitious, as it can only process 127 blocks maximum per transaction. Either way, a whole bunch of bytes could be transferred during data-in, 65,024 in a 127-block transfer to/from a disk. In such a case, it is possible DREQ could be deasserted and reasserted several times during data-in, depending on how well the target can keep up. If DREQ does deassert during a transfer procedure, the SCSI driver will spin in a loop until the CF94 is ready again—theoretically, the driver could get “stuck” if the target goes completely out to lunch by failing to send all requested data and not changing the bus phase to status to indicate a problem.

Eventually, all requested bytes will have been sent and the CF94 will interrupt. When this happens, the driver will determine the new bus phase, which usually will be status. Upon receiving the status byte from the target, the CF94 will again interrupt and the bus phase will now be message-in. After the message has been received, the transaction will terminate and the bus will become free.

The next capture shows successive /DACK pulses in response to being toggled during the “request sense” command’s data-in phase. This capture is with a new SCSI driver that takes advantage of the HBA’s delayed IRQ feature.

: Data-In Rate — Successive /DACK Pulses

Recall that /DACK is toggled to read a byte from the CF94’s FIFO. Hence the interval between successive /DACK toggles can give a fairly-accurate picture of the effective data transfer rate. /DACK rises as soon as the MPU has fetched from the FIFO, so /DACK’s rise is the reference point I used. That period was 1.620 µsecs with the earlier HBA design, which is equivalent to an average transfer rate of 603 KB/second. The actual transfer rate is somewhat slower than what I estimated from the cycle count in the read loop executed during data-in, possibly due to a jiffy IRQ hitting at some point during the transfer.

With a new SCSI driver not having the SEI - CLI pair in the data-in loop, the logic analyzer indicated that 1.375 µsecs elapsed between successive /DACK pulses, translating to a transfer rate of 710 KB/second, an 18 percent improvement. The 710 KB/second rate was exactly the same as the rate seen during a one-block disk write (512 bytes); the data-out loop has never had a timing race.

For now, 710 KB/second is about as fast as POC V1.3 is going to go during SCSI operations. This limit is essentially a function of how rapidly the MPU can do “pretend DMA” during an information transfer phase. Ø2 clock speed is mostly the determining factor, and Ø2 at 16 MHz is as fast as the V1.3 hardware can go. In theory, if V1.3 could be run at 20 MHz, the transfer rate could increase to about 880 KB/second. The next POC design should take care of that.

————————————————————
¹The 74ACT541 bus driver also presents the state of the HBA’s bus ID jumpers on bits 0-2 inclusive.

In POC V1.3, all ROM and I/O device accesses are wait-stated to support 16 MHz operation. As the 541 bus driver is part of the HBA hardware, reading it incurs a wait-state, even though the 541 is fast enough to not require one. Glue logic in a future POC design will discriminate between accessing the 541 and accessing the CF94 so as to avoid wait-stating the former.

²Although floated by virtue of all devices being in the high-Z state, the terminators bias the bus to approximately 3 volts with respect to ground. For historical reasons, all single-ended SCSI bus signals are active-low.

POC Computer Version One

Re: POC Computer Version One

POC Computer Version One: Still Sluggin’ Along

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

POC Computer Version One: SCSI Revisited

Re: POC Computer Version One

Re: POC Computer Version One

Re: POC Computer Version One

POC Computer Version One: SCSI Revisited

POC Computer Version One: SCSI Revisited