6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 2:04 am

All times are UTC




Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Tue Feb 11, 2014 1:13 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
.
As a rule, vintage 6502 computers don't map their I/O devices into zero page. It's almost a tradition that the decode hardware will put I/O at some higher address instead. :) But should new 65xx designs follow the rule, too? I started this thread to talk about how much faster I/O in zero-page can be, and how the pros clearly outweigh the cons for some new designs.

I have edited this post in light of subsequent discussions. Here are the main points to consider:
  • benefits of I/O in z-pg: speed
  • benefits of I/O in z-pg: smaller code size
  • possible impediment: users' unfamiliarity with the idea
  • possible impediment: consumption of zero-page addresses (balanced against fast I/O as a priority)
  • possible impediment: decode hardware has less freedom to ignore address lines -- you need full decoding (or close to it).


Code Size and Speed
The chart below shows various instructions commonly used for I/O, listing the size and speed. In the cases (shown in color) where a given mode or instruction is not available, a roughly-equivalent "next best" code sequence is evaluated. For BBS/BBR and their equivalents, the usual optional extra cycles apply (for branch-taken and branch-taken with a page crossing).
Attachment:
IO in Zero-page comparisons.gif
IO in Zero-page comparisons.gif [ 13.22 KiB | Viewed 3877 times ]

The 33% speedup for byte-oriented I/O (eg LDA STA) is noteworthy. Even greater are the gains for BBS/BBR (about 55%) and SMB/RMB (100% :shock: ) -- bit-oriented instructions which, unfortunately, are available only on the 'C02. However, even the SMB/RMB-equivalent code (used by '816 and NMOS '02) shows a 20% speedup if the I/O is in zero-page.

IMO, these gains could in some cases make or break the viability of a heavily I/O bound application. Example: SPI or a software UART.

If I were being more rigorous, the equivalent routines should begin with PHA and end with PLA. But that'll often be unnecessary, as Garth mentions below.

On the '816, "Zero Page" is known as Direct Page, and is relocatable via the Direct Page register. It's worth remembering that, during I/O, this register must be set to zero (or whatever page is decoded for I/O). This is usually a non-issue but may be a nuisance if you're using D's esoteric capabilities elsewhere in your code.


Possible Impediments

In the thread that leads up to this one, Garth remarked that I/O mapped in Zero-pg is not normal for our applications. Certainly I/O in Zero-pg is uncommon -- and the pros & cons are worth reviewing! The rarity of I/O in Z-pg is partly explained by the high premium placed on Zero-page itself:

  • with I/O in Z-pg, it's more important to fully decode the I/O address since it's undesirable to have many images of each device.
  • in vintage microcomputers, free addresses in Z-pg are unobtainium. That's a regrettable reality.
  • Unfortunately, yesterday's reality engenders a mindset that some modern-day builders seem to accept without question -- namely, that nothing merits the sacrifice of Z-pg. But in the context of a new design, the tradeoff might be attractive -- or even compelling.

Later in this topic I have some schematics to, uh, address (!) the challenge of decoding. Without proper attention, decoding is an issue that could limit clock speed in a system designed for absolute maximum performance.

Finally, I think I/O in Z-pg is rare simply because it's not talked about enough. IOW, its potential gets overlooked -- and that's a shame. But it wouldn't happen if we examine the matter on a case-by-case basis. My own experience has been 100% positive, and even WDC maps the I/O of its W65C134S into Zero-page. As I say, this approach needs to be mentioned more often.

-- Jeff

An incidental point (also from the other thread):
GARTHWILSON wrote:
I guess they [BBRn, BBSn, RMBn and SMBn instructions] were in ZP because a few microcontrollers had I/O there and they had so little ROM that it was important to save one byte each time one of these instructions was encountered
Absolute (as opposed to Zero-pg) versions of BBS/BBR might be impossible -- or at least in conflict with existing limitations on the sequencing logic. With branch-taken and a page crossing, Absolute BBS and BBR would be eight-cycle instructions! But I agree that saving code bytes in ROM may have been a consideration. Another factor is throughput, since the Zero-pg versions are about 20% faster than the Absolute versions would've been.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Mon Jun 30, 2014 4:36 am, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 2:58 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
Even though I've said it would be nice to have SMB & friends, your substitute code was of course covering all (or most of, since it did not preserve P) the bases, but we usually try approach the code in a way that needs less of that.  For example, the SPI bit-bang code I have on my website includes this for sending a byte:
Code:
SEND_BYT:                     ; Start with byte in A.  Slave must already be selected.
    PHA                       ; Put the input number on the stack because we need A for the TRB/TSB mask below.
        CLK_DN                ; Prepare for mode-0 transmit, and for high clock pulse with INC DEC.
        TSX                   ; Put the stack pointer in X for shifting below without bringing it back into the accum.
        LDA  #2               ; 2 is the value of the MOSI bit for TSB & TRB.
        FOR_Y  8, DOWN_TO, 0  ; 8 is the number of bits we will shift out and test in the loop.
            ASL  101,X        ; Shift the input number left, since transmission is msb-first.
            IF_C_CLR          ; The bit gets put in the carry flag.
                TRB  VIA3PB   ; If the bit was a 0,  clear the MOSI bit in VIA3PB,
            ELSE_             ; otherwise
                TSB  VIA3PB   ; (ie, the bit was a 1), set the MOSI bit in VIA3PB.
            END_IF
            CLK_HI_PULSE      ; Pulse the clock line.  The INC/DEC does not affect A.
        NEXT_Y                ; Decr the counter and see if we need to repeat the loop.
    PLA                       ; Remove the number from the stack.  (We don't need it anymore, but we need the stack cleaned up.)
    RTS
 ;------------------

(I have not written code to receive and transmit at the same time since there has been no need of that with the SPI devices I've used.)  Here the PHA and PLA are done only once for the 8 bits, and the mask is loaded into the accumulator only once.  At the expense of a ZP variable, slightly greater efficiency could be gained by shifting that variable instead of doing it on the stack with the ASL 101,X which is shown above.

Regarding the full address decoding and ZP loss penalty of putting I/O in ZP, my workbench computer would lose 76 bytes of ZP (30% of ZP) with full decoding, and the logic delays would be quite a lot more than than the two NAND stages I used now (or that I could replace with a section of a 74AC139 at 8.5ns max prop delay total), and would reduce the maximum clock speed, unless I resorted to programmable logic.

Using a data stack (which works out particularly well for ZP on the 6502) does reduce the need for so many variables.  I wonder how the ZP memory maps might be different in the Apple II and Commodore if their kernels had taken advantage of this.  Before I started working this way, and in situations like PIC16 where I was not able, I had to use more variables, and since the RAM was in short supply, I would try to double up and have different routines use the same RAM addresses as long as they didn't need them at the same time; but you have to be super careful.  A data stack avoids the problems, and the variables cease to exist when they're no longer needed.  It's a type of automatic garbage collection.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 5:53 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
GARTHWILSON wrote:
Using a data stack [...] I wonder how the ZP memory maps might be different in the Apple II and Commodore if their kernels had taken advantage of this.
It's easy to imagine that, as time went on, new features got added piecemeal -- and the list of z-pg requirements just grew and grew. A unified organization such as a stack might've largely avoided that. IOW a new design needn't include the horrible congestion seen in old designs.

Quote:
full address decoding [...] would reduce the maximum clock speed, unless I resorted to programmable logic.
As I said, each case needs to be examined individually. To decode z-pg, what you need is a big, wide OR (or NOR) gate. Some systems have programmable logic, and a big, wide OR/NOR is no problem. Other systems -- those using slower memory, peripherals and clock -- have less demanding timing margins, and it's OK to replace that big gate with some SSI equivalent circuit. You'll still benefit from the reduced cycle count of I/O in z-pg. But I admit programmable logic may be the only answer if you're pushing for maximum clock rates.

One very tidy decoding solution is the 74ALS679 -- a configurable 13-input gate which has no trouble in plunking the 16 registers of a VIA, for instance, into z-pg for you. It can easily be configured as big, wide NOR (but it won't achieve 8.5ns max prop delay). Another way to get a wide NOR (14-input) is with both sections of a 74F260 feeding into an 'ACT138 (only one output would be used). Or a 74FCT521 might serve as part of the scheme.

Quote:
my workbench computer would lose 76 bytes of ZP
Well, nothing says all the I/O has to be mapped to z-pg. You could map some of the less critical devices elsewhere -- and, if the "elsewhere" location is judiciously chosen, there'll be zero added complexity in the decode circuitry. A 50:50 split between Page 0 and Page 2 works nicely, for example. The circuit is virtually identical to that required for putting it all in pg 0.

FWIW, my own experience is with my workbench computer -- the Kimklone. :D It has a Rockwell 'C02 running at 5 MHz, and the I/O decoding uses a 74HC679. There are three VIA's (48 bytes total) mapped into z-pg, and I have oodles of space left over since the only other thing in z-pg is the Forth data stack.

Quote:
your substitute code was of course covering all (or most of, since it did not preserve P) the bases, but we usually try approach the code in a way that needs less of that.
I agree that if we're careful then we need "less of that." I had mixed feelings about including PHA & PLA in my code examples. Still, your SPI code snippet can run a lot faster if the bits being banged are mapped to z-page. (Edit: also expect further improvements in the posts to follow.)
Code:
SEND_BYT:
                              ; save 3~; don't PHA
        CLK_DN                ; save 3~; RMB z-pg replaces LDA #1 and TRB abs. (details in Garth's post below)
                              ; save 2~; don't TSX
                              ; save 2~; don't LDA #2
        FOR_Y  8, DOWN_TO, 0  ;
            ASL  A            ; save 5~; ASL abs,X is 7~ but ASL A is only 2~
            IF_C_CLR          ;
                RMB  VIA3PB   ; save 1~. RMB z-pg replaces TRB abs.
            ELSE_             ;
                SMB  VIA3PB   ; save 1~. SMB z-pg replaces TSB abs.
            END_IF
            CLK_HI_PULSE      ; save 2~. INC DEC are each 1~ faster for z-pg than for abs.
        NEXT_Y                ;
                              ; save 4~; don't PLA
    RTS                       ;
The speedup we see here is from shorter instructions, from not having to push/pop A, and from not having a bit mask held in A. I hope I'm correct in assuming CLK_HI_PULSE assembles an INC abs followed by DEC abs (or vice versa).

FWIW, it's possible to optimize away use of Y as a loop counter, but that's not pertinent to I/O in z-pg so I'll leave it alone. BTW I sure do like the structured macros!

cheers
Jeff

[Edit for clarity, and better understanding of the CLK_DN macro. Tweak code comments and emphasize the split-I/O-area option.]

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Tue Feb 11, 2014 6:58 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 6:26 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Dr Jefyll wrote:
... Absolute (as opposed to Zero-pg) versions of BBS/BBR might be impossible -- or at least in conflict with existing limitations on the sequencing logic. With branch-taken and a page crossing, Absolute BBS and BBR would be eight-cycle instructions! But I agree that saving code bytes in ROM may have been a consideration. Another factor is throughput, since the Zero-pg versions are about 20% faster than the Absolute versions would've been ...

The absolute versions would be four bytes long, wouldn't they? Is that a major issue with regard to the design of the internal state machine?

Dr Jefyll wrote:
Finally, I think I/O in Z-pg is rare simply because it's not talked about enough. IOW, its potential gets overlooked -- and that's a shame. But it wouldn't happen if we examine the matter on a case-by-case basis. BDD mentioned some reluctance of his own -- also some useful workarounds -- and that's fine. But my own experience has been 100% positive, and even WDC maps the I/O of its W65C134S into Zero-page. As I say, this approach needs to be mentioned more often...

Excellent point. The hardware complexity vs. software complexity balance can be tilted either way, depending on the design requirements, and your code comparison definitely caught my attention.


GARTHWILSON wrote:
... Using a data stack (which works out particularly well for ZP on the 6502) does reduce the need for so many variables. I wonder how the ZP memory maps might be different in the Apple II and Commodore if their kernels had taken advantage of this ...

Another excellent point. It wouldn't have been hard to profile zp use while running the monitor, DOS and BASIC interpreter, and I'm sure that doing so would have exposed some inefficient usage ... in my opinion, the firmware gobbled up far more than necessary, and could have benefited significantly from a liberal dose of data-stack mentality.

Mike


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 6:38 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
Thanks for the info on the '679! That looks better than the '521 or '688, except, as you say, the disappointing speed.

The CLK_DN macro is:
Code:
CLK_DN: MACRO                 ; NOTE!  Accum is not preserved!
        LDA  #1
        TRB  VIA3PB
        ENDM
 ;------------------

so it does step on the accumulator's contents; but if I/O were in ZP, then it could be replaced with an RMB also, again freeing up the accumulator so it wouldn't need to be bracketed with PHA...PLA to make the rest work the way you show, with I/O in ZP for the SMB & RMB in the loop.

Quote:
FWIW, it's possible to optimize away use of Y as a loop counter, but that's not pertinent to I/O in z-pg so I'll leave it alone.

Any suggestions to optimize any of my code are always welcome.  I don't have any emotional need to be the origin of all the most-efficient code, but rather, if there's a way to further improve it, that would be the priority.  I have a fair amount on my website in different places, and if I can ever get the stacks primer done [Edit: It's up, at http://wilsonminesco.com/stacks/ ], it will have a ton, including much of what one would need to write a STC Forth.

Quote:
BTW I sure do like the structured macros!

Thanks.  I want to improve those on my website too, now having done two more PIC projects with them and definitely never wanting to go back to working without them!  A few times when I've tried to re-write someone else's posted code with the structure macros to make it more clear what's happening, I run into a difficulty untangling the spaghetti.  Although I think that being more structure-oriented would cause the programmer to take a different approach, I was trying to re-write it in a way that would result in the same machine code but have much clearer source code.  I do want to implement a couple of other things though, like multiple WHILEs between BEGIN and REPEAT, and a BEGIN...WHILE...UNTIL...THEN, and add more options in the FOR_X/Y...NEXT_X/Y.  In my piece of code posted above though, what you would write longhand in assembly is probably exactly the same that the macros would produce.  IOW, there's no penalty to using the macros—only benefits.

Quote:
The absolute versions would be four bytes long, wouldn't they? Is that a major issue with regard to the design of the internal state machine?

I think the issue is the limitation of the number of clocks without a huge increase in the size of the array.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 7:12 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
barrym95838 wrote:
in my opinion, the firmware gobbled up far more than necessary
I'm no authority on vintage 6502ery, but certainly that's my hunch as well.
Quote:
The absolute versions would be four bytes long, wouldn't they? Is that a major issue with regard to the design of the internal state machine?
Yes I guess BBS/BBR absolute would be four bytes, but that's incidental. As Garth says, it's the cycle count that's of concern. Somewhere we have a link to a 1970's magazine article that touches on this, although I can't find it just now.

During early development, the 6502 design underwent some drastic surgery to meet die area (ie, cost) objectives. IIRC, one of the things that got chopped was automatic pushing of the accumulator upon interrupt. Nice feature, but it would've demanded that the sequencing logic support eight-cycle events. Evidently that would've required that the entire PLA be a little longer. (Or wider?) Anyway, saving A is something that can also be done in software. The feature was dropped and the die area objectives were achieved.

Edit - here's that reference: This is the press clipping, and BigEd's lead post in this thread includes some details. Chuck Peddle is quoted as saying the 6502 team "tried an 8-state machine which would dump the accumulator as well as the program counter and status but since it added another 10 mils to the chip, we discarded it."

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Tue Feb 11, 2014 4:34 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 8:21 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
GARTHWILSON wrote:
Thanks for the info on the '679! That looks better than the '521 or '688, except, as you say, the disappointing speed.
Unfortunately, there are other disappointments. AFAIK the '679 was never produced in any process faster than HC or ALS. The HC version seems to have disappeared. And nowadays Digikey and Mouser are asking more than $20 for an 'ALS679! :?

Quote:
Any suggestions to optimize any of my code are always welcome.
Ummm... If you really wanna max out the speed of that SPI routine, the final tweak might demand some of that spaghetti we were hoping to avoid. :roll: I'm referring to the IF ELSE ENDIF that's in the body of the loop. I suspect the ELSE translates to an unconditional branch, and in this case it's pure overhead -- the unconditional branch does no "work." Instead the fastest code will use a conditional branch (as part of an extra copy of the loop bottom).

But as for eliminating Y as a loop counter, there's no spaghetti required. Instead of Y, I'd use the actual data in the "shift register." That's the accumulator... and the carry flag -- nine bits! IOW, just SEC before entering the loop. At the bottom of the loop (after each shift), loop if Not Zero. The accumulator is guaranteed non-zero until all 8 actual data bits have been shifted. :D

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Tue Feb 11, 2014 8:22 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 8:22 am 
Offline

Joined: Sun Nov 01, 2009 2:22 pm
Posts: 21
Location: United Kingdom
GARTHWILSON wrote:
Regarding the full address decoding and ZP loss penalty of putting I/O in ZP, my workbench computer would lose 76 bytes of ZP (30% of ZP) with full decoding,


How about optimising the use of zp I/O by stacking similar devices at the same address? Controling access to each similar device with a zp control byte, which would actually drive the chip select through, for example a latch. latching /CS1 while normal zp I/O logic asserts CS0 on the device. It seems to me you'd get enhanced zp I/O and reduced zp overhead for a 'marginal' increase in s/w complexity.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 8:41 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
APL wrote:
GARTHWILSON wrote:
Regarding the full address decoding and ZP loss penalty of putting I/O in ZP, my workbench computer would lose 76 bytes of ZP (30% of ZP) with full decoding,

How about optimising the use of zp I/O by stacking similar devices at the same address? Controling access to each similar device with a zp control byte, which would actually drive the chip select through, for example a latch. latching /CS1 while normal zp I/O logic asserts CS0 on the device. It seems to me you'd get enhanced zp I/O and reduced zp overhead for a 'marginal' increase in s/w complexity.

So would you have a latch with its own address, and you write to it the bits specifying which IC to address next?  Actually, come to think of it, the 6520 does a similar thing.  Suppose you had its base address at 0000.  Control register A then is at address 0001; and its bit 2 determines whether using address 0000 will access port A or its data-direction register.  The 6522 separates these into individual addresses, making it easier.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 11, 2014 9:02 am 
Offline

Joined: Sun Nov 01, 2009 2:22 pm
Posts: 21
Location: United Kingdom
GARTHWILSON wrote:
So would you have a latch with its own address, and you write to it the bits specifying which IC to address next? Actually, come to think of it, the 6520 does a similar thing. Suppose you had its base address at 0000. Control register A then is at address 0001; and its bit 2 determines whether using address 0000 will access port A or its data-direction register. The 6522 separates these into individual addresses, making it easier.


Yes, that was what I was thinking. The latch would have its own zp address - to refine it further, when you read the latch address, you get the interupt status of the devices in the bank. Thinking a little more, All your zero page I/O could go in the foot print of the largest device - for example 16 bytes for a 65c22. The zp overhead would then be sixteen bytes plus one for the control/interrupt register. Up to eight devices, VIA, ACIA etc in the one 17 byte zp footprint, of course the trade off would be added code complexity.

Edit:

Dr Jefyll wrote:
is also accelerated, but "only" by 33%


Which might be a worthwhile price for ~33% acceleration.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 20, 2014 10:09 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
I wrote:
it's possible to optimize away use of Y as a loop counter, but that's not pertinent to I/O in z-pg so I'll leave it alone.
GARTHWILSON wrote:
Any suggestions to optimize any of my code are always welcome.
Alright, I'll apply my "no-loop-counter" speedup first -- and then the "I/O in zero-page" speedup after that optimization, not before. That'll make the percentage of speedup for I/O in zero-page seem more dramatic! :D (Joking aside, we know that a program that doesn't spend most of its time actually doing I/O won't accelerate as much, percentage-wise, as one that's heavily I/O bound.)

As the initial reference, let's begin with Garth's version. I have expanded the macros so all instructions are explicitly listed. Later I'll compare this with alternative versions. In all cases,
  • average execution time is calculated based on a 50:50 proportion of ones and zeros in each byte
  • branches are presumed not to involve page crossings
  • the RTS is not included in the timing. Presumably any speed-obsessed coder would in-line the SEND_BYT routine (at least when it's contained in a loop).
  • those versions (like Garth's) that trash X, Y and A have PHY and PLY added. This helps to level the playing field, as other versions inherently preserve one or more register(s).

My pencil'n'paper tally says the average execution time for Garth's version is 299 cycles per byte:
Code:
SEND_BYT: PHY           3 ;
          PHA           3 ;Push the input number as TopOfStack
          LDA  #1       2 ;value of the clk bit
          TRB  VIA3PB   6 ;ensure clk is low
          TSX           2 ;X=S so we can use X to address TOS for shifting
          LDA  #2       2 ;value of the MOSI bit
          LDY  #8       2 ;number of bits in the loop.
                          ;
Looptop:  ASL  101,X    7 ;Shift the input number left (ie: msb-first) into Carry.
          BCS  BitIs1  2/3;if bit in Carry is 1, branch to BitIs1
          TRB  VIA3PB   6 ;Bit was a 0. Clear the MOSI bit in VIA3PortB
          BRA  BitDone  3 ;bit is done
BitIs1:   TSB  VIA3PB   6 ;Bit was a 1. Set the MOSI bit in VIA3PortB
                          ;
BitDone:  INC  VIA3PB   6 ;set clk high
          DEC  VIA3PB   6 ;set clk low
          DEY           2 ;Decr the counter
          BNE  Looptop 2/3;if counter not 0 yet, branch back & repeat the loop.

          PLA           4 ;clean the input number off the stack
          PLY           4 ;
          RTS

The next version averages 272.5~/byte -- a 9.7% speedup compared to the original. Y is not used, and the "bottom" of the loop is duplicated -- ie, there are two instances of BNE Looptop and the preceding instructions.
Code:
SEND_BYT: PHA            3 ;Push the input number as TopOfStack
          LDA  #1        2 ;value of the clk bit
          TRB  VIA3PB    6 ;ensure clk is low
          TSX            2 ;X=S so we can use X to address TOS for shifting
          LDA  #2        2 ;value of the MOSI bit
          SEC            2 ;the only 1 that'll get shifted into TOS
          ROL  101,X     7 ;carry <- TOS <- 1. msb is first to go to C for testing.
                           ;
Looptop:  BCC  BitIs0   2/3;if C clr, branch fwd
          TSB  VIA3PB    6 ;bit was a 1. Set the MOSI bit in VIA3PortB
          INC  VIA3PB    6 ;Pulse the clock line.
          DEC  VIA3PB    6 ;
          ASL  101,X     7 ;carry <- TOS <- 0
          BNE  Looptop  3/2;If TOS<>0 br back to repeat the loop.
          PLA            4 ;clean the input number off the stack
          BRA  Exit      3 ;


BitIs0:   TRB  VIA3PB      ;otherwise Bit was a 0. Clear the MOSI bit in VIA3PortB
          INC  VIA3PB      ;Pulse the clock line.
          DEC  VIA3PB      ;
          ASL  101,X       ;carry <- TOS <- 0
          BNE  Looptop     ;If TOS<>0 br back to repeat the loop.
          PLA              ;clean the input number off the stack
Exit:     RTS              ;


The "I/O in zero-page" version averages 189.5 ~/byte. This is 44% faster than the already-optimized version. And X and Y are both preserved.
Code:
SEND_BYT: RMB0 VIA3PB    5 ;ensure clk is low
          SEC            2 ;the only 1 that'll get shifted into Accumulator
          ROL  A         2 ;carry <- Acc <- 1. msb is first to go to C for testing.
                           ;
Looptop:  BCC  BitIs0   2/3;if C clr, branch fwd
          SMB1 VIA3PB    5 ;bit was a 1. Set the MOSI bit in VIA3PortB
          INC  VIA3PB    5 ;Pulse the clock line.
          DEC  VIA3PB    5 ;
          ASL  A         2 ;carry <- Acc <- 0
          BNE  Looptop  3/2;If Acc<>0 br back to repeat the loop.
          BRA  Exit      3 ;

BitIs0:   RMB1 VIA3PB      ;otherwise Bit was a 0. Clear the MOSI bit in VIA3PortB
          INC  VIA3PB      ;Pulse the clock line.
          DEC  VIA3PB      ;
          ASL  A           ;carry <- Acc <- 0
          BNE  Looptop     ;If TOS<>0 br back to repeat the loop.
Exit:     RTS              ;


Corrections are welcome if there's a typo or other boo-boo anywhere. Further speed increases are possible, up to twice as fast as the original -- or more. :shock: I can share that code if anyone's interested. With that version, I/O in zero-page is definitely helpful, but the speedup is mostly due to a table-based approach. Each nibble of data generates a 16-way branch, and each branch destination outputs 4 bits without pausing to take a breath! There's also an assumption that all bits of VIA3 PortB -- including the six bits not used for SPI -- may be read, held in a register, then written back later. That's problematic if an interrupt might modify any of those six bits, as the changes would be lost. But if interrupts don't touch those bits, or if the bits are used as inputs, then you're good to go! :D

cheers
Jeff
ps- the lead post has been revised somewhat -- for clarity, and for an '816 and NMOS '02 perspective on I/O in zero-page.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 20, 2014 8:33 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
I like your different way to know when to stop looping.  Just replace the BRA Exit with RTS.  The JSR/RTS are needed, as this gets called from many, many places in an application.

I'm sure I will never be putting I/O in ZP myself.  If I ever build another 'c02 computer (rather than '816), I'm more likely to use your single-cycle 5-bit output port using undocumented NOPs.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 21, 2014 12:02 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
GARTHWILSON wrote:
I'm sure I will never be putting I/O in ZP myself.
If you're pushing for maximum CPU clock speed and can't use programmable logic for decoding, that makes sense. As I've said all along, each design needs to be be considered individually -- which seems an obvious thing to underscore, but I/O in zero-page oughtn't to be overlooked merely because it happens to be uncommon! There are some who seem to flatly dismiss the idea as a non-starter. My own experience and WDC's use of I/O in zero-page (W65C134S) refute this.

Quote:
The JSR/RTS are needed, as this gets called from many, many places in an application.
Sure, I understand. But applications that seriously need the bit-banged SPI to be as fast as possible can gain a few percent by having a dedicated copy of the routine in-line where the "hot spot" occurs. Out of the many, many places in an application that you need to transmit a byte, probably only one is critical -- and it's a block move. (If the SPI is talking to a temperature sensor, you won't encounter this. But talking to an SD card or EEPROM you probably will.)

In the context of a max-performance block move, SEND_BYT would typically be embedded in a loop, with LDA (SrcBuffer),Y preceding it and INY / BNE following it -- all of that together would be the hot spot. The bracketing code adds about 10.5 cycles, which tends to "dilute" the 44% speedup I was crowing about. :roll: But, even so, the speedup is still 41%. IMO, speeding up a program's hot spot by 41% amply supports my assertion that this can make or break the viability of an application. (And it's not all about SPI. For another example, consider a bit-banged software UART that fails to achieve the requisite baud rate.)

Keep the idea in your bag of tricks. Someday you might surprise yourself and use it! :D

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 21, 2014 3:28 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
I re-wrote my original using your different criteria for knowing when to quit looping, avoiding using Y, but also avoiding duplicating the end of the loop to keep the program smaller and keeping the non-ZP I/O, and got a reduction of 5 cycles from my original, if your 299 number included the JSR:
Code:
SEND_BYT:                    ; bytes  cycles   tot cy
    PHA                      ;  1       3         3
        CLK_DN ;(leaves A=1) ;  5       8         8
        TSX                  ;  1       2         2
        INA                  ;  1       2         2
        SEC                  ;  1       2         2
        ROL  101,X           ;  3       7         7
        BEGIN                ;  0
            IF_CARRY_SET     ;  2      2/3       20, for 4 1's and 4 0's
               TSB  VIA3PB   ;  3       6        24          "
            ELSE_            ;  2       3        12          "
               TRB  VIA3PB   ;  3       6        24          "
            END_IF           ;  0
            CLK_HI_PULSE     ;  6      12        96
            ASL  101,X       ;  3       7        56
        UNTIL_ZERO           ;  2      3/2       23
    PLA                      ;  1       3         3
    RTS                      ;  1       6         6
 ;--------------------                          ___
                                                294 (including the JSR)

It seems pretty clear that all the tactics must be used together to make much of a dent in execution time.  Using your single-cycle output port though, 80 clocks could be removed from just the SPI clock-pulsing, 48 more from setting the MOSI line, and, not needing the mask in A for TSB/TRB, the shifting/rotating can be done in A, saving 45 more.  That takes it down to 121 cycles, if I counted right.  That's even without duplicating the end of the loop.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 21, 2014 6:47 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Quote:
got a reduction of 5 cycles from my original, if your 299 number included the JSR
No, not 5 -- each of us is including different stuff in the calculation. I think eliminating Y as the loop counter saves 9 cycles. We lose 8 DEY's (and save 16~) in the loop, but there are 9 shift operations, not 8, so the saving is diminished by 7~. (It'd only diminish 2~ if A weren't tied up and we could do the shifts there.)
Quote:
It seems pretty clear that all the tactics must be used together to make much of a dent in execution time.
Putting it that way isn't wrong, but seems to miss the point that I/O in zero-page is as beneficial as all the other tactics combined (except the fast but bloated 16-way-JMP version I mentioned).

However, I will admit wishing I'd chosen a slightly less sensational title for this topic. A speedup like 41% is "significant," or even "major," but not exactly "huge"! :oops:

Quote:
Using your single-cycle output port [...] takes it down to 121 cycles
Hmmm... could your SPI input routine benefit from special hardware, too? How does your machine presently read in MISO? Is it on VIA3PB?

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: MSN [Bot] and 64 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: