6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 12:00 pm

All times are UTC




Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sat Jan 20, 2024 8:24 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
Although this topic describes a programming situation with my POC V1.3 unit, it isn’t specific to it, as anyone who is using a multichannel UART might find it of interest.

V1.3 has four serial I/O (SIO) channels, requiring an array of eight circular queues (FIFOs) defined in contiguous RAM, four for reception (RxQs) and four for transmission (TxQs).  A zero-based index is used to identify which channel is being processed.  The general processing methodology I’ve devised can support any even-numbered quantity of SIO channels, theoretically up to 128 maximum.  The practical limit appears to be 32 channels, although I suspect the volume of processing required to handle continuous, bi-directional traffic on that many channels would leave little time to do much of anything else.  :D

Anyhow, in order to keep a lid on the cycle count in the SIO primitives, two arrays of pointers are set up on direct (zero) page during system POST, one array for the RxQs and the other for the TxQs.  Each queue has two pointers assigned to it, referred to for discussion purposes as “get” and “put”.  A fetch is executed with LDA (get,X) and a store with STA (put,X), in which get and put are the base addresses of the RxQ and TxQ pointer arrays, respectively.  In both cases, the offset in .X is set to channel index × 2.

Prior to attempting a fetch, it must be determined if there is at least one valid datum in the target queue, which will be true if put != get.  Similarly, prior to attempting a store, it must be determined if the target queue can accept a datum, which will be true if put+1 != get.  In some cases, it will be useful to know to what extent an RxQ has been filled, e.g., as in managing flow-control.  Assuming a 256-byte queue size, that may be determined with put - get, with borrow discarded.

My primary concern with this scheme is the large amount of RAM being consumed to support it.  The queue array occupies 2KB (256 × 4 × 2) and the queue pointer array occupies 32 bytes of direct page.  There are also pointers to three hardware registers per channel, along with bit fields on direct page that track things such as which receiver’s RTS has been de-asserted and which transmitter has been disabled because it has nothing to transmit.  All-in-all, 60 direct-page bytes are dedicated to SIO processing.

Something determined from experimentation with POC V1.1 is that with the MPU running at double-digit Ø2 rates, smaller SIO queues are practical.  In V1.3, which runs 28 percent faster than V1.1, the MPU can process the SIO queues nearly 40 times faster than data can come in or go out the serial ports when running at 115.2 Kbps.  With that in mind, I’m looking to modify the SIO primitives to work with a 128-byte queue size, which will free up 1KB.  That 1KB can be used to rearrange the bank $00 map for better utilization.

Of course, there’s always a catch, and in this case, a smaller queue size complicates queue pointer management.  Following each queue fetch or store, the corresponding pointer’s least-significant byte (LSB) is incremented to point to the next location.  As the present queue size is 256 bytes, that scheme requires no special processing, since the pointer’s most-significant byte (MSB) is static and the LSB will wrap when the queue’s upper boundary has been reached.  Testing for queue space is also trivial—it’s the simple put != get or put+1 != get comparisons.  However, such simplicity isn’t possible with a 128-byte queue.

The queue array starts on a page boundary, i.e., $00xx00.  Hence an even-numbered channel’s queue will end at $xx7F and an odd-numbered channel’s queue will begin at $xx80.  This being the case, incrementing a pointer’s LSB would have to be followed with a check for range and if outside of its queue’s boundaries, fixed up as necessary, i.e., normalized.  Additionally, a test for queue space cannot be a simple comparison, since a comparison is really an unsigned subtraction.  For example, executing put+1 != get to see if an even-numbered channel’s queue can accept a datum will blow up if put is $7Fput+1 would have to be normalized before the comparison could be carried out.

Cutting to the chase, I’ve been idly monkeying with some code to come up with a way to manage the pointers.  Here’s what I’ve got so far.  All it does is increment the pointers, with normalization to keep them within their respective queue boundaries:

Code:
   .opt proc65c02,swapbin ;Kowalski assembler options directive
;===============================================================================
;
;QUEUE POINTER INCREMENTATION SIMULATION
;
;   ———————————————————————————————————————————————————————————————————————
;   This program tests an algorithm for incrementing TIA-232 circular queue
;   pointers using a smaller queue size.  Test conditions are:
;
;   § There are four channels, selected with a zero-based index.
;
;   § Each queue is 128 bytes in size.
;
;   § The base address for the queue array is $BC00.  The algorithm being
;     simulated assumes that the queue array base is page-aligned.
;
;   § An even-numbered channel’s queue starts on a page boundary.  An odd-
;     numbered channel’s queue starts at a page boundary plus the queue
;     size.  Hence, even-numbered channels’ queues are in the range $xx00-
;     $xx7F & odd-numbered channels’ queues are in the range $xx80-$xxFF,
;     where $xx is $BC, $BD, etc.
;
;   In addition to maintaining queue pointers within the stipulated ranges,
;   it is desired to use a minimum of clock cycles & code, although these
;   goals are likely mutually exclusive.
———————————————————————————————————————————————————————————————————————
;
;===============================================================================
;
;TEST DEFINITIONS
;
n_nxpchn =4                    ;number of channels
s_ptr    =2                    ;size of a pointer
s_queue  =128                  ;size of a queue
ptrbase  =$a0                  ;pointer array base address
quebase  =$bc00                ;queue array base address
;
worktmp  =$f0                  ;setup working queue address
;
;===============================================================================
;
   *=$2000
;
;===============================================================================
;
;SET UP TEST ENVIRONMENT
;
setup    lda #>quebase         ;queue array base address MSB
         sta worktmp+1         ;set working address MSB
         lda #<quebase         ;queue array base address LSB
         ldy #0                ;starting channel index
;
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;   Initialize Each Channel’s Queue Pointer
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;
setup010 sta worktmp           ;set working address LSB
         tya                   ;current channel
         asl                   ;pointer array offset
         tax                   ;offset —> .X
         lda worktmp+1         ;set this channel’s queue...
         sta ptrbase+1,x       ;starting boundary MSB
         lda worktmp           ;set this channel’s queue...
         sta ptrbase,x         ;starting boundary LSB
         iny                   ;bump channel index
         cpy #n_nxpchn         ;all channels done?
         beq simstart          ;yes
;
         adc #s_queue          ;no, point to next...
         bcc setup010          ;channel’s queue
;
         inc worktmp+1         ;bump queue pointer MSB
         bra setup010          ;do next channel
;
;===============================================================================
;
;SIMULATED CHANNEL PROCESSING
;
simstart ldy #0                ;starting channel index       
;
loop     tya                   ;channel index
         asl                   ;channel offset
         tax                   ;offset —> .X ...
;
;         —-—-—-—-—-—-—-—-
;         chan 0 —> .X = 0
;         chan 1 —> .X = 2
;         chan 2 —> .X = 4
;         chan 3 —> .X = 6
;         —-—-—-—-—-—-—-—-
;
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;   *** Beginning of Test Algorithm ***
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;
         tya                   ;channel index
         lsr                   ;condition carry...
;
;         —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-
;         c == 0 indicates even-numbered channel, queue LSB
;                is in the $00-7F range...
;
;         c == 1 indicates odd-numbered channel, queue LSB
;                is in the $80-FF range.
;
;         For reference, a byte within a given queue is acc-
;         essed with (<dp>,X) addressing, which is why only
;         the pointer LSB is modified.
;         —-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-
;
         inc ptrbase,x         ;bump address LSB
         bcc is_even
;
;
;   odd-numbered channel...
;
         bmi next              ;LSB still in range
;
         lda #s_queue          ;reset LSB to...
         sta ptrbase,x         ;to $80-FF range
         bra next
;
;
;   even-numbered channel...
;
is_even  bpl next              ;LSB still in range
;
         stz ptrbase,x         ;reset to $00-$7F range
;
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;   *** End of Test Algorithm ***
;   —-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;
next     iny                   ;next channel
         cpy #n_nxpchn         ;all channels processed?
         bcc loop              ;no
;
         brk                   ;yes
         nop
         bra setup
;
   .end

I suspect there may be a more-efficient way to do this, but have yet to see it.  Maybe one of you will.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 11:32 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I think it's an interesting approach, to branch based on the parity of the channel index.

Another thing to consider is whether you could just store a base offset and buffer "fullness" count, rather than storing head and tail pointers. These would always be within the range 0..127, and wouldn't need any special processing - it might make the actual buffer reads and writes slightly more expensive but I'd guess not by much. All the other operations (checking full, empty, number of bytes available) would be straightforward.

Also this doesn't feel too bad to me if it's OK to use the A register, but I'm not sure exactly how it compares to what you already have:
Code:
    lda ptrbase,x
    asl    ; save high bit to carry
    inc    ; add...
    inc    ; ... two
    ror    ; restore high bit from carry


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 3:06 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1117
Location: Albuquerque NM USA
Quad serial device like OX16C954 has 128-byte FIFO for each of its transmitters and receivers. How does that change the queue management algorithm? Is queue even necessary?
Bill


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 20, 2024 6:08 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Here's one approach that I think reproduces the original output:

Code:
setup      ldy #>quebase
       lda #<quebase
       ldx #0

setup010   sty ptrbase+1,x
setup020   sta ptrbase,x
      cpx #(n_nexpchn-1) * 2
      bcs simstart

      inx
      inx
      adc #s_queue
      bcc setup020
      iny
      bra setup010


simstart   ldx #$00

loop      txa
      lsr
      lsr
      
      lda ptrbase,x
      inc
      bcc is_even

      bmi next
      lda #s_queue
      bra next

is_even    bpl next
      lda #$00

next      sta ptrbase,x
      inx
      inx
      cpx #n_nexpchn * 2
      bcc loop


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 21, 2024 6:45 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
gfoot wrote:
Another thing to consider is whether you could just store a base offset and buffer "fullness" count, rather than storing head and tail pointers.

I could, but the “fullness” count is susceptible to being in error, since it has to be religiously incremented when a datum is stored and decremented when a datum is fetched.  Prior to either operation, the count has to be checked to avoid an over- or underflow.  So it’s really a different version of the same thing.

Quote:
...it might make the actual buffer reads and writes slightly more expensive but I'd guess not by much.

It would, which is an argument against it.

Quote:
Also this doesn't feel too bad to me if it's OK to use the A register, but I'm not sure exactly how it compares to what you already have:

Code:
    lda ptrbase,x
    asl    ; save high bit to carry
    inc    ; add...
    inc    ; ... two
    ror    ; restore high bit from carry

That will work and in the parts of the SIO primitive in which queue pointers have to be bumped, the accumulator is unencumbered.  Also, when the 816’s accumulator is set to eight bits, the B accumulator can be used as a temporary store by means of the XBA instructon.

Next time I have the Kowalski assembler/simulator stoked up I’ll test your algorithm for cycle count.  Your code eliminates branches, which is bound to help.  Also, I see a way I can use it to do the “get” and “put” queue-management comparisons.  I’ll also test doing R-M-W ops on the pointer itself just to see what that does to the cycle count.

plasmo wrote:
Quad serial device like OX16C954 has 128-byte FIFO for each of its transmitters and receivers.  How does that change the queue management algorithm?

The queue management algorithm would not vary due to on-hardware FIFOs.  However, experimentation and real-world experience has shown me that the receive queue should always be larger than the FIFO to allow the entire FIFO content to be transferred to the queue from the processing of a single IRQ.  Otherwise, there is a risk of the queue being completely filled, leaving no room for the datums still in the FIFO.  I’m not sure with the 16C954, but leaving datums in the FIFO will cause most UARTs of that type to keep asserting /IRQ, potentially leading to deadlock—the FIFO has to be emptied to clear the interrupt.  In my UART primitives, I completely empty the receiver FIFO on each IRQ to avoid this problem.

The alternative if the receive queue is full is to continue reading from the UART until the FIFO is empty to clear the interrupt, and discard the datums that can’t be queued.  Obviously, doing so will corrupt the data stream.

Quote:
Is queue even necessary?

Yes.  Without queues, the system will tend to bog down during periods of continuous reception and transmission, since any I/O access will have to occur at whatever rate the I/O device can operate.  Queuing helps to decouple the pace at which the UART operates from the pace at which the rest of the system operates.

All the FIFOs do for the system is reduce the frequency of device interrupts, which was a significant factor in the replacement of the old 8250 UART in Intel hardware with the 16550 types.  The 8086 and 80286 had horrible interrupt latency, and often it was impossible to run the serial ports faster than about 9600 bpS because the MPU couldn’t service the 8250 fast enough to avoid overrun errors.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 21, 2024 10:29 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Another slight optimisation - if you advance the pointers in the opposite direction then you can do ASL : DEC : ROR instead and save another few cycles.


Last edited by gfoot on Mon Jan 22, 2024 10:33 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 21, 2024 5:22 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
gfoot wrote:
I think it's an interesting approach, to branch based on the parity of the channel index.

Another thing to consider is whether you could just store a base offset and buffer "fullness" count, rather than storing head and tail pointers. These would always be within the range 0..127, and wouldn't need any special processing - it might make the actual buffer reads and writes slightly more expensive but I'd guess not by much. All the other operations (checking full, empty, number of bytes available) would be straightforward.


If you have separate head and tail pointers then they are managed by separate processes - the producer updates the head and the consumer updates the tail. There is no locking (semaphores, mutual exclusion, or even just disabling interrupts, etc.) required as there would be if the 2 processes were accessing the same variable. In this scenario the consumer can read bytes out of a queue without disabling interrupts (or otherwise creating a queue lock).

Bumping either pointer is easier if the buffer size and start is a power of 2 too as it then comes down to simple INC, AND and CMP operations.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 22, 2024 5:56 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
gfoot wrote:
Another sought optimisation - if you advance the pointers in the opposite direction then you can do ASL : DEC : ROR instead and save another few cycles.

That would work and would save two clock cycles.  Only downside is if debugging requires examining a queue’s content, that content will be backwards.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 22, 2024 3:50 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Though it belatedly occurs to me that if the accumulator is set to 16 bits, then possibly:

Code:
setup
     ldx #0
     lda #quebase

setup010
    stl ptrbase,x
    cpx #(n_nexpchn-1) * 2
    bcs simstart
    adc #s_queue
    inx
    inx
    bra setup010


No messing about with where the direct page is.

Though I'm not sure about the rest.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 23, 2024 9:16 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
teamtempest wrote:
Though it belatedly occurs to me that if the accumulator is set to 16 bits, then possibly...

The get-put loops in the interrupt handler handle a byte at a time, so there isn’t any point in the code in which 16-bit processing would be advantageous.  In fact, the addition of a REP and SEP pair would length the per-loop cycle count by four clocks.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 24, 2024 7:24 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Mmm, I can see that, but I thought the setup was a one-time initialization cost. Are you saying that this setup occurs each time an interrupt hits? On the face of it, that would seem to be wasteful.

I agree the main handler would not benefit much (if at all) from 16-bit processing.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 24, 2024 7:27 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 255
If you like your algorithm with 8-bit math (eg. throwing away any carry/borrow information and wrapping automatically), why not just do that with less bits (eg. mask off the bits you'd like to ignore)?

After doing any operation that modifies an index or calculates the difference, just do an AND #$7F if you want to use 7-bit indexes (128 byte buffers) or AND #$3F for 64 byte buffers, etc. It does require a power of two for your buffer size, but it retains all the features you liked about 8-bit indices and only adds a couple of cycles to each operation that touches the indices.

After re-reading the original post, I see it's the getting the zero page based pointers to work properly that is more difficult. For those to work with this scheme, you just need to take the LSb of your channel# and move it into the MSb of your index before dereferencing to the actual data with (base,x)? That would force even channels to use $00-$7F and odd to use $80-$FF, with each "even/odd pair" of addresses having the same MSB value in zero page.

For 64 byte buffers, you'd move the two LSbs into the two MSbs.

[Abbrev. key: MSb = Most Significant bit, MSB = Most Significant BYTE, LSb = Least Significant Bit]


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 24, 2024 7:54 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
teamtempest wrote:
Mmm, I can see that, but I thought the setup was a one-time initialization cost. Are you saying that this setup occurs each time an interrupt hits? On the face of it, that would seem to be wasteful.

I agree the main handler would not benefit much (if at all) from 16-bit processing.

Oh, I think you misunderstood what the topic was about.  The setup wasn’t what I was discussing.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 31, 2024 2:37 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
Having cogitated on this for a while, I’ve worked out management details for a 128 byte queue.  Since the proof is in the pudding, I modified the relevant parts of POC V1.3’s firmware and gave it a try.  Much to my utter astonishment, it worked right on the first try.  I really was expecting to see baby barf all over the console screen when I powered up the unit.  I did cheat, so to speak, by testing the pointer management parts in the Kowalski simulator, so I did have some confidence that that part of it would work.

The attached files are from the assembly listing, illustrating the foreground and background components of the serial I/O (SIO) driver.  Code is for the 65C816, of course, but some minor modifications can adapt the driver to the 65C02.

POC V1.3 has four SIO channels, but due to the way in which the driver is designed, it could be made to service up to 127 channels.  I suspect 16 channels would likely be a practical limit, since if all channels are communicating in both directions at the same time, the interrupt processing load would become quite impressive.  :D


Attachment:
File comment: SIO Driver Foreground
small_queue_foreground.txt [7.65 KiB]
Downloaded 114 times
Attachment:
File comment: SIO Driver Background
small_queue_background.txt [6.82 KiB]
Downloaded 110 times

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 31, 2024 5:01 am 
Offline
User avatar

Joined: Fri Jan 26, 2024 5:47 am
Posts: 40
Location: Prague; Czech Republic; Europe; Earth
BigDumbDinosaur wrote:
Otherwise, there is a risk of the queue being completely filled, leaving no room for the datums still in the FIFO.  I’m not sure with the 16C954, but leaving datums in the FIFO will cause most UARTs of that type to keep asserting /IRQ, potentially leading to deadlock—the FIFO has to be emptied to clear the interrupt.


Maybe it may help (or not) - I plan to check if "The End is near" (queue nearly full) and set flow-controll to block more bytes comming (RTS), then when the "The End is here" full queue I disable generation of interrupts, so the last byte is still in ACIA and I get no indication about it, but is not discarted.

Then the reading routine can enable interrupt again after reading a byte and (if the queue is reasonably empty) deassert the RTS too. (And I will get interrupt and read the last byte. At least I hope so.)

If the other party is reasonable, it will stop sending me bytes when I am busy (RTS) and only few already on the way would arrive after (hopefully less, then rest of the queue can accomodate). When I make place (and free RTS), the other party will continue emptying its buffers on me.

If the flow-controll do not work, than I take as much of bytes as I can and start listening as soon as I can take any byte more.

As setting/clearing RTS is just one store instruction for me, I do not test, if RTS is set or not, and set it anyway, because it is less cyckles, then to read, compare and branch. And setting it to the same value does not change anything. It takes more instructions probabelly, but I see it as more correct in my setting. But it is now just theory, as my HW is under developement and so not working.

_________________
http://micro-corner.gilhad.cz/, http://8bit.gilhad.cz/6809/Expanduino/Expanduino_I.html, http://comp24.gilhad.cz/Comp24-specification.html, and many others


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 23 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: