6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Jul 01, 2024 6:21 am

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Thu Jan 30, 2014 2:40 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1378
Well, since I've rewritten the basic startup procedure and replaced all of the interrupt code plus character in/out routines (in my latest SyMon update), I started looking at execution times for sending and receiving a character and the overall percentage of CPU time it takes to manage an Async controller. For reference, I'll put the code execution time here and then show the code used to service the 6551 chip.

There are a handful of routines which initialize the soft vectors and the I/O devices and all interrupt service routines use indirect jumps to page $03 to get in and out of the ROM routines. This costs some time (10 clock cycles) but allows flexibility to intercept all of the interrupt processes (NMI/BRK/IRQ). The CHIN and CHOUT routines manage the 128-byte circular buffers for receive and transmit. In the case of CHOUT, it also checks a flag to see if the 6551 is enabled for XMIT IRQ and if not, turns it on. The IRQ service routine manages data in and out of the chip and the buffers inversely from the CHIN/CHOUT routines. It also handles the BRK instruction and a null character as a software panic break. The transmit routine also turns off the XMIT IRQ once the buffer empties out and updates the flag pointer. All pointers are in page $00 and the dual 128-byte buffers are page $02.

I've done the calculations for service times in clock cycles:

IRQ Service Routine:
7 clock cycles IRQ response latency (from current executing instruction to jump to the IRQ vector)
48 clock cycles for ROM IRQ pre-processing and post-processing including both indirect JMPs tp page $03

Shortest time thru IRQ vector (not 6551 as cause) = 7 clock cycles (62 total for non 6551 IRQ)

RCV routine branch = 15 clock cycles
XMT routine branch = 23 clock cycles
CTS Error (fall thru IRQ decision tree) = 31 clock cycles (86 total for CTS error)

RCVCHR routine: LO/HI = buffer wraparound = No/Yes
Clock cycles = 41/42 for no XMT bit on (1011/112 total for no xmit)
Clock cycles = 40/41 drop to XMTCHR routine (110/111 total for drop to xmit)

XMTCHR routine: LO/HI = buffer wraparound = No/Yes
Clock cycles = 32/33 if xmit stays on (110/111 total if XMIT left on)
Clock cycles = 44/45 if xmit to be turned off (122/123 total if XMIT gets turned off)
Clock cycles = 19 if no data left, turn off xmit and exit (97 total)

Character routines:
CHIN - 47/48 clock cycles to get character from buffer
- input loop is 6 clock cycles per then add 47/48 per above

CHOUT - 54/55 clock cycles if XMIT already on, 67/68 if XMIT needs to be turned on
- output loop is 6 clock cycles per then add 54/55 or 67/68 as per above

Pretty sure I got them right on clock cycles. If you use standard 19.2K bps on the 6551 chip, that's about 1920 interrupts per second when sending data continuously. That results in an IRQ generated by the 6551 every 521 microseconds. Looking at the number of clock cycles to send a character (assuming ideal conditions, once the XMIT is turned on), it takes 110 clock cycles per character and you add 1 cycle every time the buffer wraps around. That's 211215 clock cycles to send 1920 characters of data in 1 second of time. If sustained at this rate, that would take just over 21% of the CPU bandwidth for a 1MHz 65C02. Add in the CHOUT routine time to put the characters in the buffer and you need to add another 54 clock cycles per character and add 1 cycle for buffer wraparound. That's another 103695 clock cycles to put 1920 characters into the buffer. And another 10% of the CPU bandwidth. In total, 314910 clock cycles per second for sustained transmit (under ideal conditions), or about 31.5% of the CPU bandwidth. Add in sustained receive and things get worse, albeit the code can do a RCV/XMIT in one IRQ service under ideal conditions.

Two things come out of this, going to higher speed with a 6551 style chip (i.e., no on-board FIFOs) becomes a performance inhibitor. Second, it makes sense to run at much faster clock rates when possible. The same code doing the same function running on an 8MHz 65C02 would only require about 4% of the CPU bandwidth. For reference, here's the pertinent routines:

Code:
IRQ_VECTOR   ;This is the ROM start for the BRK/IRQ handler
             PHA   ;Save A Reg (3)
             PHX   ;Save X Reg (3)
             PHY   ;Save Y Reg (3)
             TSX   ;Get Stack pointer (2)
             LDA   $0100+4,X   ;Get Status Register (4)
             AND   #$10   ;Mask for BRK bit set (2)
             BEQ   DO_IRQ   ;If not set, handle IRQ (2/3)
             JMP   (BRKVEC)   ;Jump to Soft vectored BRK Handler (5) (24 clock cycles to vector routine)
DO_IRQ    JMP   (IRQVEC)   ;Jump to Soft vectored IRQ Handler (5) (25 clock cycles to vector routine)
;
IRQ_EXIT      ;This is the standard return for the IRQ/BRK handler routines
             PLY   ;Restore Y Reg (4)
             PLX   ;Restore X Reg (4)
             PLA   ;Restore A Reg (4)
             RTI   ;Return from IRQ/BRK routine (6) (18 clock cycles from vector jump to IRQ end)
;


Code:
BRKINSTR   PLY   ;Restore Y reg
               PLX   ;Restore X Reg
               PLA   ;Restore A Reg
               STA   ACCUM   ;Save A Reg
               STX   XREG   ;Save X Reg
               STY   YREG   ;Save Y Reg
               PLA   ;Get Processor Status           
               STA   PREG   ;Save in PROCESSOR STATUS preset/result
               TSX
               STX   SREG   ;Save STACK POINTER
               PLA   ;Pull RETURN address from STACK then save it in INDEX
               STA   INDEX   ;Low byte
               PLA
               STA   INDEXH   ;High byte
               JSR   CR2   ;Send 2 CR,LF to terminal
               JSR   PRSTAT   ;Display contents of all preset/result memory locations
               JSR   CROUT   ;Send CR,LF to terminal
               JSR   DISLINE   ;Disassemble then display instruction at address pointed to by INDEX
               LDA   #$00   ;Clear all PROCESSOR STATUS REGISTER bits
               PHA
               PLP
BREAKEY2   LDA   #$7F   ;Set STACK POINTER preset/result to $7F
               STA   SREG
               STZ   ITAIL   ;Zero out input buffer and reset pointers
               STZ   IHEAD
               STZ   ICNT
BR_NMON      BRA   NMON   ;Done interrupt service process, re-enter monitor
;
BREAKEY      PLY   ;Pull Y Reg (4)
               PLX   ;Pull X Reg (4)
               PLA   ;Pull A Reg (4)
               CLI   ;Enable IRQ (2)
               BRA   BREAKEY2   ;Finish Break Key processing (2/3)
;
;new full duplex IRQ handler
;
INTERUPT   BIT   SIOSTAT   ;xfer irq bit to n flag (4)
               BPL   REGEXT   ;if set, 6551 caused irq,(do not branch) (2/3) (7 clock cycles to regexit if not)
;
ASYNC         LDA SIOSTAT   ;get 6551 status reg (4)
               AND #%00001000   ;check receive bit (2)
               BNE RCVCHR   ;get received character (2/3) (15 clock cycles to jump to RCV)
;
               LDA SIOSTAT   ;get 6551 status reg (4)
               AND #%00010000   ;check xmit bit (2)
               BNE XMTCHR   ;send xmit character (2/3) (23 clock cycles to jump tp XMIT)
;
;no bits on means cts went high
               LDA #%00010000 ;cts high mask (2)
;
IRQEXT      STA STTVAL ;update status value (3) (31 clock cycles to here of CTS fallout)
;
REGEXT      JMP   (IRQRTVEC) ;handle old irq (5)
;
BUFFUL      LDA #%00001100 ;buffer overflow (2)
               BRA IRQEXT ;branch to exit (2/3)
;
RCVCHR      LDA SIODAT   ;get character from 6551 (4)
               BEQ   BREAKEY   ;If Break character, branch to Break Key process    (2/3)
;
RCV0         LDY ICNT   ;get buffer counter (3)
               BMI   BUFFUL   ;check against limit, branch if full (2/3)
;
               LDY ITAIL ;room in buffer (3)
               STA IBUF,Y ;store into buffer (5)
               INY ;increment tail pointer (2)
               BPL   RCV1   ;check for wraparound ($%80), branch if not (2/3)
               LDY #$00 ;else, reset pointer (2)
RCV1         STY ITAIL ;update buffer tail pointer (3)
               INC ICNT ;increment character count (5)
;   
               LDA SIOSTAT ;get 6551 status reg (4)
               AND #%00010000 ;check for xmit (2)
               BEQ REGEXT   ;exit (2/3) (41 if exit, else 40 and drop to XMT)
;
XMTCHR      LDA OCNT ;any characters to xmit? (3)
               BEQ NODATA ;no, turn off xmit (2/3)
;
OUTDAT      LDY OHEAD ;get pointer to buffer (3)
               LDA OBUF,Y ;get the next character (4)
               STA SIODAT ;send the data (4)
               INY ;increment index (2)
               BPL   OUTD1   ;check for wraparound ($80), branch if not (2/3)
               LDY #$00 ;else, reset pointer (2)
;
OUTD1         STY OHEAD ;save new head index (3)
               DEC OCNT ;decrement counter (5)
               BNE   REGEXT   ;If not zero, exit and continue normal stuff (2/3) (32 if branch, 31 if continue)
;
NODATA      LDA   #$09   ;get mask for xmit off / rcv on (2)
               STA SIOCOM ;turn off xmit irq bits (5)
               STZ OIE ; zero pointer (3)
               BRA REGEXT ;exit (3) (13 clock cycles added for turning off xmt)
;


Code:
;CHOUT subroutine: takes the character in the ACCUMULATOR and places it in the xmit buffer
;this routine also preserves the character sent in the A reg on exit (standard one did also)
;   new routine to work with the new IRQ service routine for the 6551
; now transmit is IRQ driven and buffered
;   the output buffer is fixed at 128 bytes, so buffer management is added
;
CHOUT         PHY ;save Y reg
OUTCH         LDY OCNT ;get character output count in buffer
               BMI   OUTCH   ;check against limit, loop back if full
;
               PHP   ;Save IRQ state
               SEI ;Disable irq
               LDY OTAIL ;Get index to next spot
               STA OBUF,Y ;and place in buffer
               INY ;Increment index
               BPL   OUTC1   ;Check for wrap-around ($80), branch if not
               LDY #$00 ;Yes, zero pointer
;
OUTC1         STY OTAIL ;Update pointer
               INC OCNT ;Increment character count
               BIT OIE ;Is xmit on?
               BMI OUTC2 ;Yes, operation done
;
               LDY   #$05   ;Get mask for xmit on
               STY SIOCOM ;Turn on xmit irq
               DEC OIE ;Update flag
;
OUTC2         PLP   ;Restore IRQ flag
               PLY   ;Restore Y reg
               RTS ;Return
;
;CHIN subroutine: Wait for a keystroke from input buffer, return with keystroke in A Reg
;   new routine to work with the new IRQ service routine for the 6551
;   the input buffer is fixed at 128 bytes, so buffer management is replaced
;
CHIN         LDA   ICNT   ;Get character count
               BEQ   CHIN   ;If zero (no character, loop back)
;
               PHY   ;Save Y reg
               PHP   ;Save CPU flag set
               SEI   ;Disable IRQ to work with buffer pointers
               LDY   IHEAD   ;Get the buffer head pointer
               LDA   IBUF,Y   ;Get the character from the buffer
               INY   ;Increment the buffer index
               BPL   CHIN1   ;Check for wraparound ($80), branch if not
               LDY   #$00   ;Reset the buffer pointer
CHIN1         STY   IHEAD   ;Update buffer pointer
               DEC   ICNT   ;Decrement the buffer count
               PLP   ;Restore previous CPU flags (IRQ)
               PLY   ;Restore Y Reg
               RTS   ;Return to caller with character in A reg
;


Overall the code has been very solid. Outside of removing the indirect jumps for soft vectors, I don't see much of a way to streamline the code any further, sans increasing the buffers to 256 bytes each which would would streamline the buffer management a bit. My next board will hopefully be running at 8MHz and using a console chip running at 38.4k, but not the 6551 chip. Comments welcome.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 3:22 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1938
Location: Sacramento, CA, USA
floobydust wrote:
... That's 211215 clock cycles to send 1920 characters of data in 1 second of time. If sustained at this rate, that would take just over 21% of the CPU bandwidth for a 1MHz 65C02. Add in the CHOUT routine time to put the characters in the buffer and you need to add another 54 clock cycles per character and add 1 cycle for buffer wraparound. That's another 103695 clock cycles to put 1920 characters into the buffer. And another 10% of the CPU bandwidth. In total, 314910 clock cycles per second for sustained transmit (under ideal conditions), or about 31.5% of the CPU bandwidth. Add in sustained receive and things get worse, albeit the code can do a RCV/XMIT in one IRQ service under ideal conditions.

Two things come out of this, going to higher speed with a 6551 style chip (i.e., no on-board FIFOs) becomes a performance inhibitor. Second, it makes sense to run at much faster clock rates when possible. The same code doing the same function running on an 8MHz 65C02 would only require about 4% of the CPU bandwidth ...

Nice work! Your code looks pretty tight at first glance. I don't see 31.5% being much of a problem in a single-user system. How often would you be full-throttling your 6551 while trying to (simultaneously) do something else in need of similar response performance?

Mike


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 3:59 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1378
Thanks... I tried to ensure no extra instructions and no internal subroutines. The nice thing is you can pre or post the ROM routines or replace them easily. You can also change the HW parameters as they are soft as well and use the ROM init routines to set it up.

True on the 31.5%... may not be an issue in most cases. However, the next step is to add 6522 timer support, add a high-resolution clock for timer services and RTC. I'm planning on a 4 millisecond interval for the timer (can easily scale to 16MHz clock rate). Other longer term ideas are data collection, using timer services to sample I/O (digital and analog), timestamp it, buffer it, and send it via async. So I'm looking to keep code size both small and fast and use higher clock rates to keep the CPU more available.

I still need to put together a small SCC2691 UART board so I can add that in and start writing code for it as a new console replacement. So much to do for a hobby and so little free time.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:09 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8460
Location: Southern California
floobydust wrote:
the next step is to add 6522 timer support, add a high-resolution clock for timer services and RTC. I'm planning on a 4 millisecond interval for the timer (can easily scale to 16MHz clock rate). Other longer term ideas are data collection, using timer services to sample I/O (digital and analog), timestamp it, buffer it, and send it via async. So I'm looking to keep code size both small and fast and use higher clock rates to keep the CPU more available.

The code I'm using for that (RTC on a 6522 timer, T1) is just past the middle of the page at http://6502.org/tutorials/interrupts.html. I put the RTC on NMI, eliminating the polling overhead for the RTC. The RTC actually keeps two sets of RAM registers: a set for a 32-bit count of centiseconds (hundredths of seconds), and a set with a byte for hundredths, which rolls over at 100 and carries into the byte for seconds, which rolls over at 60 and carries into the byte for minutes, and so on. Of course in 99% of the cases, it exits before even getting to the seconds. For many things, the former set of variable bytes is easier to use, but the second one is easier to keep time of day and calendar on, not requiring conversion every time you want TOD and date.

I also do alarms with the routine, such that the alarm-installation program puts the alarm due soonest at the head of the line for the RTC routine to watch for a match, and execute the program at the address associated with the alarm time when it comes due, so you can schedule jobs with 10ms resolution, making for one of several components I have for effectively doing multitasking without a multitasking OS. Many alarms can be pending at once, but the RTC routine only watches for the first one in chronological order. Each alarm's routine can of course set another alarm for it to run again if desired.

The RTC is useful for timing lots of things simultaneously, as different routines being run concurrently each put target times in their own variables and watch for them, and exit if they see it's not time to do anything yet—another component in effectively doing multitasking without a multitasking OS. (See my article on simple ways to do multitasking without a multitasking OS, for small systems that may not have the resources to run such an OS, or where hard, realtime requirements may rule one out, at http://wilsonminesco.com/multitask/ .) For example, one routine is does key scanning, debouncing, and auto key repeat, while at the same time another routine is timing an LED flash rate that gives an indication of low battery, flashing faster and faster as the battery gets lower, and other routines are timing other things, and none of them are affected by the others.

When I want to do audio sampling on regular intervals like 24,000 evenly spaced interrupts per second for example, I normally use a separate VIA's T1 for that, and usually turn off the RTC interrupts. For that, it might have been better to put the RTC interrupts on IRQ and put the sampling interrupt on NMI so I don't have to temporarily turn off the RTC to prevent jitter.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:13 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8236
Location: Midwestern USA
floobydust wrote:
Two things come out of this, going to higher speed with a 6551 style chip (i.e., no on-board FIFOs) becomes a performance inhibitor. Second, it makes sense to run at much faster clock rates when possible. The same code doing the same function running on an 8MHz 65C02 would only require about 4% of the CPU bandwidth.

And at 20 MHz the MPU utilization would be around 1.6 percent. However, with, say, the NXP 26C92 DUART, which has eight byte FIFOs for RxD and TxD, the load decreases to around 0.2 percent per channel during worst-case CBAT, and simultaneous operation on both channels may reduce the load even more.

Ironically, although the 65C816 has a slightly longer interrupt latency than the 'C02, it can plow through this sort of code even faster in native mode. For one thing, it has a dedicated IRQ vector, avoiding the time-consuming stack sniff needed with the 'C02 to figure out if the interrupt was an IRQ or a BRK. Also, some register instructions that are absent in the 'C02, e.g., XBA, can be used to reduce memory accesses at certain points.

Quote:
Code:
IRQ_VECTOR PHA                 ;Save A Reg (3)
         PHX                   ;Save X Reg (3)
         PHY                   ;Save Y Reg (3)
         TSX                   ;Get Stack pointer (2)
         LDA $0100+4,X         ;Get Status Register (4)
         AND #$10              ;Mask for BRK bit set (2)
         BEQ DO_IRQ            ;If not set, handle IRQ (2/3)
;
         JMP (BRKVEC)          ;Jump to Soft vectored BRK Handler (5) (24 clock cycles to vector routine)
;
DO_IRQ   JMP (IRQVEC)          ;Jump to Soft vectored IRQ Handler (5) (25 clock cycles to vector routine)

I'd rearrange the above like so:

Code:
IRQ_VECTOR PHA                 ;Save A Reg (3)
         PHX                   ;Save X Reg (3)
         PHY                   ;Save Y Reg (3)
         TSX                   ;Get Stack pointer (2)
         LDA $0100+4,X         ;Get Status Register (4)
         AND #$10              ;Mask for BRK bit set (2)
         BNE IS_BRK            ;If set, handle BRK (2/3), otherwise fall thru
;
         JMP (IRQVEC)          ;process IRQ
;
IS_BRK   JMP (BRKVEC)          ;process BRK

You always want to write your code so that the branch is not taken in the most common or most critical case, which in this case, means you don't want to branch on each IRQ. That saves you one clock cycle per IRQ.

Quote:
Code:
ASYNC    LDA SIOSTAT           ;get 6551 status reg (4)
         AND #%00001000        ;check receive bit (2)
         BNE RCVCHR            ;get received character (2/3) (15 clock cycles to jump to RCV)
;
         LDA SIOSTAT           ;get 6551 status reg (4)
         AND #%00010000        ;check xmit bit (2)
         BNE XMTCHR            ;send xmit character (2/3) (23 clock cycles to jump to XMIT)
...

You indicated that you are using a 65C02. If so, you can reduce the above to:

Code:
ASYNC    LDA SIOSTAT           ;get 6551 status reg (4)
         BIT #%00001000        ;check receive bit (2)
         BNE RCVCHR            ;get received character (2/3) (15 clock cycles to jump to RCV)
;
         BIT #%00010000        ;check xmit bit (2)
         BNE XMTCHR            ;send xmit character (2/3) (23 clock cycles to jump to XMIT)
...

and avoid the extra read of the status register. There is a very slight chance that the transmitter could underrun between the first BIT test and the second one, but no harm would come of it. Worst case would be that an occasional extra IRQ would occur due to the TxD underrun. However, the average throughput would be higher, since four fewer clock cycles would be consumed.

Quote:
Overall the code has been very solid. Outside of removing the indirect jumps for soft vectors, I don't see much of a way to streamline the code any further, sans increasing the buffers to 256 bytes each which would would streamline the buffer management a bit. My next board will hopefully be running at 8MHz and using a console chip running at 38.4k, but not the 6551 chip. Comments welcome.

I doubt that increasing the buffer sizes to 256 bytes will have significant effect. While an even page size per buffer reduces index acrobatics, the required instructions really don't eat all that much out of your total MPU bandwidth.

Your big bottleneck is the need to process an interrupt for each byte received or sent. So you suffer the penalty of the seven clock cycles consumed by the interrupt response itself plus the clock cycles consumed in RTIing on every byte moved. This is why UARTs with FIFOs were developed: reduction in IRQ frequency. Efficient code can move as many bytes as the FIFOs can hold in a single interrupt. The ISR code itself becomes a little more convoluted but a performance gain is realized due to the reduce interrupt barrage.

When I switched POC V1.1 from a 2692, which doesn't have a TxD FIFO, to the 26C92, which does, I saw a major decrease in the amount of IRQ processing while driving the console, especially during a full screen paint. The 65C816 was, on average, only being interrupted once per eight bytes transmitted. During S-record transfers from my UNIX box to POC, I saw the same effect, only this time on the receive side. The reason in both cases was after the ISR has been entered, the code loops back and rechecks the FIFO status for another byte (receive) or an open slot (transmit). Only when the FIFO has been fully emptied (receive) or completely filled (transmit) does the ISR continue onward.

Consider that I have both serial ports set to run at 115.2Kbps. If I run CBAT on both channels, that is a theoretical throughput of 46,080 bytes per second, or a byte being processed every 21.7 microseconds. If I were running a UART without FIFOs, I would be processing as many as 46,080 IRQs per second, which at 10 Mhz (POC V1.1's clock speed) is certainly manageable, but does consume significant MPU bandwidth: 368,640 clock cycles each second just in MPU response time—never mind the time consumed in preserving and later restoring the registers. However, by fully utilizing the eight-byte deep FIFOs, I can ideally reduce that interrupt rate to 2880 per second.

This is why I have advocated for the use of these UARTs instead of non-FIFO types, such as the 6551, 6850, 8250, etc. There are other reasons as well (e.g., independently adjustable baud rates), but the potential to significantly reduce the interrupt load is the biggie.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:23 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8236
Location: Midwestern USA
floobydust wrote:
I still need to put together a small SCC2691 UART board so I can add that in and start writing code for it as a new console replacement.

I may have said this before but I highly recommend that you use the 26C92, not the 2691 or 2692. The 2691 and 2692 have no transmit FIFOs and only four-deep receive FIFOs. The 26C92 has eight-deep FIFOs for both RxD and TxD, which if correctly utilized, can dramatically reduce the interrupt barrage when CBAT is in progress. You can just disable the second channel if you don't need it.

The other consideration is that the 2691's timing specs will probably limit your system to a 4-5 MHz Ø2 rate. I've successfully run the 2692 at 12.5 MHz and the 26C92 at 15 MHz.

Quote:
So much to do for a hobby and so little free time.

Tell us about it. :lol:

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Last edited by BigDumbDinosaur on Fri Jan 31, 2014 6:41 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:29 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8460
Location: Southern California
BigDumbDinosaur wrote:
Ironically, although the 65C816 has a slightly longer interrupt latency than the 'C02, it can plow through this sort of code even faster in native mode. For one thing, it has a dedicated IRQ vector, avoiding the time-consuming stack sniff needed with the 'C02 to figure out if the interrupt was an IRQ or a BRK.

However the '02, since it probably won't be running a preemptive multitasking OS, probably won't have much use for BRK. I haven't used BRK since I was in school in '82, so I never have the ISR check for it. :D It's usually unnecessary overhead.

Quote:
I doubt that increasing the buffer sizes to 256 bytes will have significant effect.

Probably not, but it would depend on how big the records you're receiving are and when they're needed and how long it takes to process them compared to how long it takes them to come in. If a record is more than 128 bytes and suddenly you're ready for it and it's not all there because you stopped reception because the buffer was full, now there will be an unnecessary delay while the rest comes in. Again though, that might be a rare situation.

Are there multiple manuacturers of the 26c92?

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 11:45 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1378
BDD, Garth, Thanks for the insight and input.

Garth, have not looked at your IRQ RTC code.... likely will, but want to get some code written for the 6522 and will go back and dig out some old source code to see what I did back in the 80's. BRK instruction, probably not as useful, but I recall my old VICMON cartridge allowed setting of breakpoints which will trigger the BRK instruction for debugging. Having the handler for BRK, you can zero out a byte of code and end up here, but agreed, very limited use.

BDD, Good catch on the IRQ/BRK determination code... every saved clock cycle helps. Was thinking about that one, but didn't get around to it, easy change however. Eliminating the second read of the status register is a good suggestion and using the BIT instruction instead as it doesn't alter the A reg contents, plus the 3 bytes code savings and the clock cycles. So I made the above changes and have one of the boards running some tests now to hammer on the transmit.

I agree that the 26C92 is the better choice in UARTs. However, the new board I'm working on is tiny.... 2.5"x3.8" and is pretty packed with PLCC chips and such. Having the 28-pin PLCC 2691 is a big plus for space. Looking at the datasheet, I'm thinking I should make 6MHz on the CPU clock, but I'll find out.

The new board will be using one of the FTDI DB9-USB interfaces and this device has built in buffers as well, 128 bytes on transmit and 256 bytes on receive. Once you have the driver installed on the host PC, it does additional buffering (Win7 has a 4KB buffer on these). All in all, still a work in progress, but I'm making progress and having some fun too.

Thanks again for the feedback, much appreciated.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 1:50 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1378
I've gone thru the timings again.... making the changes saved 3 bytes of code space, took 1 clock cycle off everything via the BRK/IRQ determination and 4 clock cycles off XMIT and CTS paths. The same full transmit scenario as before now takes 30.5% of the CPU time, a 1% savings. New clock cycle counts:

6551 IRQ driven/buffered receive/transmit full-duplex routines
- 128 byte circular buffers for xmt and rcv
- character in/out routines manage buffer input/output and turn on xmit IRQ
- IRQ routines handle chips data I/O and buffers inverse to character in/out - turns xmt off

7 clock cycles IRQ response latency
47 clock cycles for ROM IRQ pre-processing and post-processing including both indirect JMPs
note: 54 clock cycles overhead per above added to clock cycle counts below

ISR decision path:
Shortest time thru IRQ vector (not 6551 as cause) = add 7 clock cycles (61 total for non 6551 IRQ)
Detect RCV routine branch = add 15 clock cycles if taken
Detect XMT routine branch = add 19 clock cycles if taken
Default to CTS Error (fall out, set flag) = add 23 clock cycles (77 total for CTS error)

RCVCHR routine: LO/HI = buffer wraparound = No/Yes
Clock cycles = 41/42 for no XMT bit on (110/111 for no xmit)
Clock cycles = 40/41 drop to XMTCHR routine (109/110 for drop to xmit)

XMTCHR routine: LO/HI = buffer wraparound = No/Yes
Clock cycles = 32/33 if xmit stays on (105/106 if XMIT left on)
Clock cycles = 44/45 if xmit to be turned off (117/118 if XMIT gets turned off)
Clock cycles = 19 if no data left, turn off xmit and exit (92 total)

Character routines:
CHIN - 47/48 clock cycles to get character from buffer
- input loop is 6 clock cycles per then add 47/48 per above

CHOUT - 54/55 clock cycles if XMIT already on, 67/68 if XMIT needs to be turned on
- output loop is 6 clock cycles per then add 54/55 or 67/68 as per above

Just goes to show... always a little room for improvement ;-) Thanks.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:06 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8236
Location: Midwestern USA
GARTHWILSON wrote:
BigDumbDinosaur wrote:
Ironically, although the 65C816 has a slightly longer interrupt latency than the 'C02, it can plow through this sort of code even faster in native mode. For one thing, it has a dedicated IRQ vector, avoiding the time-consuming stack sniff needed with the 'C02 to figure out if the interrupt was an IRQ or a BRK.

However the '02, since it probably won't be running a preemptive multitasking OS, probably won't have much use for BRK. I haven't used BRK since I was in school in '82, so I never have the ISR check for it. :D It's usually unnecessary overhead.

If your system had a resident M/L monitor you'd probably want to implement a BRK handler. Otherwise, as you note, BRK's only likely use would be to access a kernel API via software interrupts (which is best done with the '816 by use of the COP instruction). I have a full BRK handler implemented in POC's firmware, but it costs nothing in interrupt processing due to the separate hardware vector provided with the '816 in native mode.

Quote:
Quote:
I doubt that increasing the buffer sizes to 256 bytes will have significant effect.

Probably not, but it would depend on how big the records you're receiving are and when they're needed and how long it takes to process them compared to how long it takes them to come in. If a record is more than 128 bytes and suddenly you're ready for it and it's not all there because you stopped reception because the buffer was full, now there will be an unnecessary delay while the rest comes in. Again though, that might be a rare situation.

I have jacked around with buffer sizes on the POC unit in an effort to see how much effect they might have. My general observation is that as Ø2 is increased buffer size becomes less critical. On the primitive side of the system, low Ø2 rates demand more buffer capacity to reduce the likelihood of RxD buffer overflows (dropped bytes) or TxD underflows (frequent blocking).

It comes down to how much RAM are you willing to dedicate to buffering. My POC unit's BIOS is currently running 64 byte buffers, which is practical because of the 10 MHz Ø2 rate and the use of the 26C92's FIFOs. I doubt that such small buffers would be practical with the 6551, except in situations where the data rate is low.

Quote:
Are there multiple manuacturers of the 26c92?

I don't know for certain. The 26C92 was developed by Philips and is now manufactured by NXP, which is the semiconductor division of Philips. As big as Philips is, I doubt that anyone has asked for a second source.

Incidentally, NXP has a variety of multiple channel UARTs, some of which are used in automotive applications for exchanging data between the main ECM and other subsystems. A four-channel version of the 26C92 is the 28C94, which will be part of my in-development POC V2 unit.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 6:49 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8236
Location: Midwestern USA
floobydust wrote:
Eliminating the second read of the status register is a good suggestion and using the BIT instruction instead as it doesn't alter the A reg contents, plus the 3 bytes code savings and the clock cycles.

Oh...forgot this when I was replying earlier. The SEI - CLI sequences in your foreground functions (e.g., CHIN) aren't necessary as long as you don't change the buffer pointer until after the fetch (receive) or store (transmit) operations. Keep in mind that the MPU can't be in two places at the same time, so there will not be contention as long as things happen in the correct order.

As an example, here's the foreground code I implemented for fetching a byte from the console serial port's RxD buffer:

Code:
;================================================================================
;
;getcha: READ FROM CONSOLE PORT
;
;   ——————————————————————————————————————————
;   Preparatory Ops: NONE
;
;   Returned Values: .A: datum or entry value
;                    .B: entry value
;                    .X: entry value
;                    .Y: entry value
;
;   MPU Flags: NVmxDIZC
;              ||||||||
;              |||||||+———> 0: datum available
;              |||||||      1: no datum
;              |||++++————> entry values
;              ||+————————> 1
;              ++—————————> entry values
;
;   Example: JSR GETCHA
;            BCS NODATUM
;            STA DATUM
;   ——————————————————————————————————————————
;
getcha   shorta                ;working with bytes
         sec                   ;assume no datum
         php                   ;save register widths
         longx                 ;16 bit index saves
         phx                   ;save
         phy                   ;save
;
;———————————————————————————————
.yreg    =1
.xreg    =.yreg+s_word
.sreg    =.xreg+s_word
;———————————————————————————————
;
         shortx                ;8 bit registers
         ldx e32rixga          ;"get" index
         cpx e32rixpa          ;"put" index
         beq .0000010          ;buffer empty
;
         ldy e32bufra,x        ;get datum
         txa                   ;old "get" index
         inc a                 ;adjust
         and #e32xmask         ;deal with wrap
         sta e32rixga          ;new "get" index
         lda .sreg,s           ;get SR stack copy
         and #sr_car_i         ;clear carry to...
         sta .sreg,s           ;indicate datum available
         tya                   ;copy datum
;
.0000010 longx                 ;16 bit index loads
         ply                   ;restore
         plx                   ;restore
         plp                   ;restore
         rts

Note that IRQs are not masked at any time. Also, since my buffer size is not a full page I mask part of the index to keep the latter's range within the size of the buffer. I can readily implement a new buffer size by simply defining a new size that is an even submultiple of 256 and changing the value of E32XMASK. Side note: since the 65C816 is a 16 bit MPU, I could define buffers that are multiples of 256 bytes.

The code for storing a byte to the console port's TxD buffer is similar:

Code:
;================================================================================
;
;putcha: WRITE TO CONSOLE PORT
;
;   —————————————————————————————————————————————————————————
;   Preparatory Ops: .A: 8 bit datum
;
;   Returned Values: .A: entry value
;                    .B: entry value
;                    .X: entry value
;                    .Y: entry value
;
;   MPU Flags: NVmxDIZC
;              ||||||||
;              |||||||+———> 0
;              +++++++————> entry values
;
;   Example: LDA #$41
;            JSR PUTCHA
;
;   Notes: 1) This function will block if the buffer is full.
;   —————————————————————————————————————————————————————————
;
putcha   clc                   ;never an error
         php                   ;save MPU state
         longr
         pha                   ;save all
         phx
         phy
         cli                   ;IRQs must be enabled
         shortr                ;8 bit registers
         tay                   ;buffer datum
         ldx e32tixpa          ;buffer "put" index
         txa                   ;make a copy
         inc a                 ;point to next position
         and #e32xmask         ;deal with wrap
;
.0000010 cmp e32tixga          ;buffer "get" index
         bne .0000020          ;buffer not full
;
         wai                   ;waste some time &...
         bra .0000010          ;try again
;
.0000020 xba                   ;hold new index
         tya                   ;recover datum &...
         sta e32bufta,x        ;buffer it
         xba                   ;recover new index &...
         sta e32tixpa          ;set it
         bit e32txst           ;transmitter enabled?
         bpl .0000030          ;yes
;
         lda #e32txsta
         trb e32txst           ;clear disable flag &...
         lda #u92crtxe         ;enable...
         sta ccr_92a           ;transmitter
;
.0000030 longr
         ply                   ;clean up stack
         plx
         pla
         plp
         rts

Again, IRQs are enabled at all times.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 30, 2014 7:52 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8460
Location: Southern California
floobydust wrote:
BRK instruction, probably not as useful, but I recall my old VICMON cartridge allowed setting of breakpoints which will trigger the BRK instruction for debugging. Having the handler for BRK, you can zero out a byte of code and end up here, but agreed, very limited use.

If there's the breakpoints thing in an old system that's already in place and you want to use it, that might be a good reason to use BRK. It's not the only way to do the job though. On the first 10,000-line program I worked on (about 1986), one of the first things I did was to write a subroutine called "BREAK" which allowed you to view/edit all registers and RAM, then return to the program when you exited. Then instead of putting the 2-byte BRK instruction in the code where I wanted to stop and look around, I put in JSR BREAK. It was with windowed EPROM and the assembler and EPROM programmer were fast enough that there was no point in avoiding doing the whole EPROM like they had to do when PROMs actually blew fuses and were not erasable.

It seems I mostly do realtime projects now, meaning the hardware they control wouldn't work at all if you stopped the program to enter a BREAK routine to look around. One thing I have done is to have it output selected operation status by serial and keep going, so I can see it on the 'scope without bringing everything to a halt.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 31, 2014 3:21 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1378
Hi BDD,

Yes, I've wrestled with the SEI instruction in the CHIN/CHOUT routines.... I rely on the PLP to also perform the CLI so if I remove the SEI opcodes, I'll save two bytes. Somehow I'm thinking you might just be able to get a page pointer changed in one routine and get corrupted by the ISR, but the more I run through the scenario, it's unlikely. For buffer wraparound, I'm just using the N flag to reflect the high-order bit when you increment past $7F, but it locks you to a 128 byte buffer size. I do like your method of using a mask during the buffer management, clever dinosaur. I will make the other changes to the routines and do some additional testing.

Garth,
Yes, BRK is an old concept... but for now I'm going to keep it. I've managed to restructure and rewrite a large chunk of the SyMon code now, most routines have been modified, many have been replaced, some eliminated and all of the startup, I/O and interrupt code is new. The BRK handler is sorta nice to have, as SyMon is a good basic monitor plus a simple assembler/disassembler (supports 65C02 opcodes) and can view/edit text, search, copy, fill memory, registers, etc. I've used the BRK function a handful of times entering and running code locally on the board, mostly as a test and it works okay.

Still a long way to go on adding additional functions (memory compare, EEPROM page write module, timer support, parallel port functions, etc.) and getting UART support for the NXP chips. But I do appreciate the insight, ideas and recommendations from both of you. Thanks.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 31, 2014 5:03 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10837
Location: England
This is great analysis and feedback! Oddly enough, because of the systems I first got to know, I tend to think of the standard 6502 speed as 2MHz, which would reduce the overhead proportionally, but 19200 safely handled with a 1MHz clock (and a dumb ACIA) is a great result - it can only get better as you step up clock speed or UART capability, or indeed swap in an '816.
Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 31, 2014 6:36 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8236
Location: Midwestern USA
floobydust wrote:
Yes, I've wrestled with the SEI instruction in the CHIN/CHOUT routines.... I rely on the PLP to also perform the CLI so if I remove the SEI opcodes, I'll save two bytes. Somehow I'm thinking you might just be able to get a page pointer changed in one routine and get corrupted by the ISR, but the more I run through the scenario, it's unlikely.

As long as the buffer index change is an atomic operation (e.g., R-M-W instruction) or is deferred until after the buffer access has occurred, you will not have any contention between the foreground and ISR. If, for example, the UART interrupts after you write the TxD buffer but before you update the foreground side TxD ("put") index, the ISR will continue to see the buffer as it was before you did the write.

Quote:
For buffer wraparound, I'm just using the N flag to reflect the high-order bit when you increment past $7F, but it locks you to a 128 byte buffer size. I do like your method of using a mask during the buffer management, clever dinosaur.

Dunno about the clever part. I do sometimes manage to fool them there humanoids with my smarts routine, though.

The mask method does constrain the buffer size to some power-of-two multiple, but also allows you to use a really small buffer if your system can handle it. I've gone as low as 16 bytes and still had satisfactory operation. Of course, that's with a relatively high Ø2 rate and a DUART with FIFOs. I suspect performance with such a small buffer would be terrible with a 65C02 and 65C51 running at 1 MHz, especially on the receive side. Data overruns would be rampant.

As far as the TxD buffer is concerned, when driving a UART without a FIFO, a bigger buffer is better. Otherwise, the foreground process writing to the buffer will frequently block awaiting buffer space.

Quote:
I will make the other changes to the routines and do some additional testing.

The code I posted, although written to take advantage of the 26C92, is readily adaptable to the 2691, since the latter device is logically one-half a 2692 (not 26C92). The 2691 has a four-deep receive FIFO and no transmit FIFO, the latter which results in the same sort of transmission bottleneck that the 6551 experiences: one interrupt per transmitted character. If you correctly implement your RxD ISR code you can get some interrupt relief. The RxD part of my ISR will (I think—I'd have to compare status registers) work with the 2691, and should be able to process and buffer up to four incoming characters per IRQ. If you want to see it for some guidance please let me know.

I also have archived the code I wrote to work with the plain 2692, so you can see that if you want.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Last edited by BigDumbDinosaur on Sat Feb 01, 2014 4:43 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: