6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 14, 2024 6:34 am

All times are UTC




Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject:
PostPosted: Sat Aug 20, 2005 11:46 pm 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
A UM* which shifts the product right (rather than left) does not have this bug, is slightly shorter (in terms of space), and is faster for most cases.

Code:
   LDA #0   ; in some implementations TYA can be used since NEXT leaves Y=$00
   STA N
   LDY #16
   LSR 3,X
   ROR 2,X
L1 BCC L2
   CLC
   PHA
   LDA N
   ADC 0,X
   STA N
   PLA
   ADC 1,X
L2 ROR
   ROR N
   ROR 3,X
   ROR 2,X
   DEY
   BNE L1
   STA 1,X
   LDA N
   STA 0,X
   JMP NEXT


At the expense of 3 (or 4) cycles, a couple of bytes can be saved by changing the BNE L1 to a BPL L1 and replacing LSR 3,X ROR 2,X with a BNE to the ROR 3,X

The PHA and PLA could be replaced by the slightly faster STA N+1 and LDA N+1.

The CLC can be eliminated, which will make UM* longer, but faster in most cases.

Code:
   LDA 3,X
   EOR #$FF
   LSR
   STA 3,X
   LDA 2,X
   EOR #$FF
   ROR
   STA 2,X
   LDA #0   ; in some implementations TYA can be used since NEXT leaves Y=$00
   STA N
   LDY #16
L1 BCS L2
   PHA      ; or use STA N+1 ...
   LDA N
   ADC 0,X
   STA N
   PLA      ; ... and LDA N+1
   ADC 1,X
   ROR
   ROR N
   ROR 3,X
   ROR 2,X
   DEY
   BNE L1
   STA 1,X
   LDA N
   STA 0,X
   JMP NEXT
L2 LSR
   ROR N
   ROR 3,X
   ROR 2,X
   DEY
   BNE L1
   STA 1,X
   LDA N
   STA 0,X
   JMP NEXT


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 09, 2012 7:15 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
U* Speedup
U* attracted my attention when whartung was testing NUMBER. Later I took a close look at dclxvi's multipy routines above, and I noticed a significant optimization used there. U* involves four bytes of data getting repeatedly shifted. But dclxvi keeps only three bytes in memory, and the remaining byte is held in the Accumulator (which of course can be shifted more quickly). The 4~ speedup nets a total saving of 64~ when all 16 iterations of the loop are considered. I like it!

My own U* (below) copies that technique and adds a trick of its own. Shifting still uses a combination of memory and the accumulator. But instead of four bytes of data all getting shifted 16 times, there's a three-byte subset that gets shifted 8 times and then a different three-byte subset shifted 8 times.
  • the first 8 shifts are A -> N -> 2,X
  • the final 8 shifts are A -> N -> 3,X
This is illustrated visually in the following post; see the third of the three animations.

( Edit: In 2021 I did an '816 version of the modified-shift multiply and posted it in this thread. It accepts two 32-bit operands and produces a 64-bit result.)

Execution time improves by about 80 cycles -- and memory usage swells by 19 bytes.
Code:
;
;         U*
;
;         based on Bruce Clark's UM*
;         http://forum.6502.org/viewtopic.php?p=4389#p4389
;         Shifting strategy modified for speed - JWL

L386      .BYTE $82,'U',$AA
          .WORD L365     ; link to CMOVE
USTAR     .WORD *+2
          CLC        ;it's OK to omit the CLC if NEXT leaves C clear. *Implementation Dependent*
          LDA 0,X    ;Copy TOS value to N+2,N+3. To eliminate CLC inside the loop, the
          SBC #0     ;value at N+2,N+3 is reduced by 1 in advance. Note: in this U*
          STA N+2    ;special accommodation is mandatory for the case of TOS = 0.
          LDA 1,X
          SBC #0
          BCC UST_Z  ;TOS = 0? Mandatory special treatment for this
          STA N+3
          TYA        ;Set A=0. Assumes NEXT has left Y=0. *Implementation Dependent*
          STA N      ;16 bits of zero in A, N
                     ; Note:    First 8 shifts are  A -> N -> 2,X
                     ;          Final 8 shifts are  A -> N -> 3,X
          STX N+4    ;tested later for exit from outer loop
          DEX        ;bias applied to X
          DEX
UST_OUTLP LDY #8     ;count for inner loop
          LSR 4,X    ;think "2,x" then later "3,X"
UST_INLP  BCC UST_NOADD
          STA N+1    ;Save time, don't CLC. Value in N+2,N+3 is reduced by 1 to allow this
          LDA N
          ADC N+2
          STA N
          LDA N+1
          ADC N+3
UST_NOADD ROR A      ;shift
          ROR N
          ROR 4,X    ;think "2,x" then later "3,X"
          DEY
          BNE UST_INLP  ;go back for 1 more shift?
          INX
          CPX N+4
          BNE UST_OUTLP ;go back for 8 more shifts?
          STA 1,X       ;ms byte of hi-word of result
          LDA N
          STA 0,X       ;ls byte of hi-word of result
UST_EXIT  JMP NEXT
UST_Z     STY 2,X       ;Assumes NEXT has left Y=0. *Implementation Dependent*
          STY 3,X
          BCC UST_EXIT

dclxvi wrote:
The PHA and PLA could be replaced by the slightly faster STA N+1 and LDA N+1.

The CLC can be eliminated, which will make UM* longer, but faster in most cases.
I've used these ideas as well, with a shorter solution to the CLC business.


As a test, I wrote and ran a looping routine that multiplies every possible combination of operands -- IOW it does 2^32 multiplications. Other code in the loop uses addition to separately keep track of what the multiplication result ought to be.

The test routine revealed some erroneous results, which I fixed by adding the BCC UST_Z etc. This causes an early exit for cases in which TOS=0. Note that this is not just a speedup tactic! The trick of reducing the value at N+2,N+3 by 1 in advance fails when the value is "reduced" from 0 to $FFFF.

-- Jeff

Edit: Clarify credit for dclxvi. Also misc. tweaks. 2021: add link to other thread.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Thu Oct 28, 2021 1:52 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2012 3:53 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
I wanted to follow up on this topic, and opted for a picture rather than 1,000 words. Or would that be 1024 words..?? :D Anyway, for clarity (and as a tutorial in case anyone's not 100% up to speed) here's an animation of the three multiplication routines mentioned earlier in this thread:.
Attachment:
UStar animation (c)Jeff Laughton.gif
UStar animation (c)Jeff Laughton.gif [ 147.78 KiB | Viewed 54999 times ]
All three routines operate by testing one bit at a time and, if the bit is a 1, performing an addition. The values involved are:
  • 16-bit Multiplicand "a" (the one that remains static)
  • 16-bit Multiplicand "b" (the one that gets shifted and tested)
  • the 32-bit result -- the Product

I have labelled the locations (eg: 0,X; eg: A) where the six bytes of data are held. For example: Garth posted the FIG multiply routine -- and the animation's upper panel is your visual reference when reading that piece of code. Likewise the animation's middle panel illustrates the Right-shifting algorithm dclxvi posted (the basic version), and the bottom panel hopefully clarifies my "Modified Right-shift" approach.

(So, if there are two 16-bit and one 32-bit values involved, why do the diagrams show 6 bytes of storage rather than 8? Folks familiar with this sort of code will recognize an ubiquitous optimization. The 32-bit result begins as a 16-bit result, gradually expanding as the routine executes. At the same time the bit-tested multiplicand shrinks -- it can gradually be discarded -- as the routine executes. The upshot is that a 16-bit portion of these two values can cohabitate in the same storage! This boosts speed. Both values can be shifted at once, and they still maintain their individual functions. BTW if you find all this shifting bewildering, it'll help to think of how the columns need to be skewed when you do multiplication with a pencil and paper. 'Nuff said -- please refer to the code!)

The other day I ran these routines on actual 65xx hardware, bracketed with code snippets to read a 6522 timer and derive the execution times. The graph below presents these figures; also the memory requirements. Not surprisingly, there's a clear tradeoff between memory usage and execution speed. The graph lists five versions of Forth's U* (unsigned multiply):
  • the FIG code Garth posted (bug-fix included)
  • the Right-Shift version dclxvi posted
  • the modified Right-Shift version I posted
  • another modified Right-Shift version (not posted) with the inner loop partially unrolled (2 bits at a time, 4 iterations)
  • another modified Right-Shift version (not posted) with no inner loop (inline code for all 8 bits)
Attachment:
UStar bargraph.gif
UStar bargraph.gif [ 13.66 KiB | Viewed 54999 times ]
Each code version has its execution time listed for three multiplication tests: 1 times 1, 5555h times 5555h, and FFFFh times FFFFh. Regarding execution time, all versions except FIG respond to the number of ones in multiplicand "b," and multiplicand "a" is irrelevant. The Right-Shift version takes 21~ extra for each additional one, and the modified Right-Shift versions takes 17~ extra (15~ extra for the fully-unrolled version, which uses Y in place of N+1). The FIG version takes 33~ extra for each additional one and (by inspecting the code) seems to also delay nonlinearly according to the value of multiplicand "a"; this is due to the potential carry that may or may not occur, propagating "backwards" into bits that've already been shifted.

Your choice of the "best" routine will vary, of course, depending on how much of a premium you place on speed (versus memory consumption). The basic Right-Shift version is compact and succinct, whereas the fully-unrolled modified Right-Shift is plainly at the point of diminishing returns (burning memory to scrounge every possible cycle, I mean). It's not a criticism; in certain circumstances profligate memory usage is perfectly justifiable. The next step of course would be to adopt a Lookup Table approach: Garth explores this powerful concept here.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Sat Aug 25, 2012 4:31 pm, edited 3 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2012 4:26 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
When you list memory usage, you're counting the size of the actual routine, not the working set of the problem, correct?


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2012 4:54 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8540
Location: Southern California
Wow, thankyou Jeff!

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 25, 2012 7:03 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Just a thought: Booth's method allows for a little less work on average, by opting for subtraction instead of addition according to a cunning plan. The state of play could be encoded in the PC by having two branches to the code (no need to explicitly hold two bits of operand to make a decision)


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 6:02 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Cheers, fellas. And thanks for the suggestion, Ed. I wasn't familiar with Booth's method, but eventually it began to make sense, and I scribbled out a flow chart in preparation to start coding. (Your "two branches to the code" idea works well here.) But on further reflection I'm not sure I understand your assertion that, "Booth's method allows for a little less work on average."

As I understand the matter (and sticking with the example of 16 bit operands), the worst case example for Booth's method will require 8 additions and 8 subtractions -- scarcely any faster or slower than 16 additions (as required by the worst case for the more conventional algorithms above). Likewise the speed in the best cases is the same. But, instead of the worst-case operand being FFFFh (or any number with lots of 1 bits), with Booth's method the worst case operand is something like 5555h or AAAAh -- where there are lots of 1-to-0 or 0-to-1 transitions between adjacent bits. So, how common are these transitions? By my calculations (I wrote a little Forth program :D ) if you look at each number from 0 to FFFFh and count the transitions seen by Booth in each, the total number is 557,056. This is slightly worse than 524,288 -- the number of 1 bits in those same 65,536 16-bit numbers. I suppose this might be a flaw in my notion of the algorithm -- and unless I code and successfully test it I won't know for sure... :roll:

BTW in the context of Forth and U* is it reasonable to optimize our coding based on the assumption that we have a random series of operands -- that all of the 64K possible values are equally probable? I'm skeptical of that assumption, but OTOH haven't come up with anything that's clearly superior. Really we're just guessing what sort of math the Forth user is coding, and what sort of numbers are likely to come up. :| Comments, anyone?

-- Jeff
ps- (for Will): when I mentioned memory usage I was refering to program space: the amount of storage occupied by the instructions.
pps- I wonder if Booth's method only shows a performance edge for numbers much longer than 16 bits...

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Sun Aug 26, 2012 6:29 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 6:22 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Hi Jeff
that doesn't sound so positive, but I think it must be a flawed analysis. Looking for example at some short strings in the middle of a word of zeros:
Code:
00000000 - no action
00000100 - single addition
00001000 - ditto
00001100 - one addition, one subtraction
00010000 - single add
00010100 - two additions
00011000 - one addition, one subtraction
00011100 - ditto (saving one op)
00100000 - single addition
00100100 - two additions
00101000 - ditto
00101100 - 3 ops: two additions and a subtraction
00110000 - one addition, one subtraction
00110100 - 3 ops: two additions and a subtraction
00111000 - one addition, one subtraction (saving one op)
00111100 - ditto (saving two ops)
Overall, we saved 4 operations (from an original 32.) Which isn't a huge proportion. But, how would your counting have analysed these?
Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 6:34 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Code:
00000100 - single addition
I see this (and similar cases) as 2 transitions. It contains the string 10 and the string 01. So, one subtraction and one addition, respectively (if we're shifting to the right). Have I misinterpreted the method?

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 6:39 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
ok, that would explain the pessimistic result. It's clear that a single addition is enough: can it be arranged? Material explaining radix-4 Booth's tells us to look at overlapping sets of 3 bits. That's more obvious for hardware, especially where a shift is free (no carry chain) and the overall result is half the number of operations. In our case, shift is just as expensive as add, unless we were going to do it anyway.
Maybe this idea won't fly - a saving of one-eighth isn't much.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 7:17 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Looking at Table 1 of http://www.geoffknagge.com/fyp/booth.shtml, if we take an 8-way branch on three bits of the operand, we can arrange for a single op (or no op) to cater for 2 bits.

If we can arrange to do everything from MSB first, that might be even easier
- the top bit is shifted out, and we BCC/BCS to the add or to the subtract branch of the code
- we shift once more
- a combination of BMI and BCS takes us to the four possibilities: an add then shift, a shift then add, or just a shift.

I am hand-waving rather than coding, but it's a saving of a quarter in additions. Would be more worthwhile for a 4-byte routine of course because the additions are more expensive. For 16-bit routines, we'd normally have an average of 8 additions, and we get that down to 6. Is that going to cover the overhead and make it worthwhile?

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 26, 2012 7:32 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
OK, you're getting ahead of me -- I'm not that quick. But definitely you are onto something. My approach was based on the Wikipedia article, under the section "A typical implementation." Their version only looks at two bits at a time. It works -- and is simple enough to make a good tutorial -- but, as you noted, it makes dumb decisions in certain cases. Three bits at a time would be better -- and even eight bits at a time may not be entirely ludicrous! It's all about anayzing tradeoffs... If I get time I'll have a go at coding a three-at-a-time version. Edit: Or at least predict the benefit, if any. Thanks for the info, Ed

J.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 29, 2012 5:13 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Not meaning to distract you, but I did some digging. I knew vaguely that there's a trick to long multiplication which uses fewer digit multiplications than you'd expect, and I wondered if a fast 4x4 or 8x8 multiplication could be used with this trick to do a fast 16x16. It turns out the trick is by Karatsuba (and was revolutionary): unfortunately it seems to demand one extra bit, so you end up needing 5x5 or 9x9 as a primitive: not so handy. Instead of doing that, and needing only 3 digit-sized multiplications but having to deal with 9x9, might be better to use 4 digit-sized multiplications in the obvious way.

How to get a fast 8x8? It turns out that there's a very neat Quarter-Square method. For the squares, you need a lookup table: seems to need 1k or 2k of table. (Would be much smaller if we proceed by nibbles and use a 4x4 primitive, but then 16x16 would take 16 of those and it's not obvious that we'd be winning.)

Inevitably, people have already tackled this, so here are some links:
26 cycles (??) for 8x8 A Different Perspective, part II by Taylor and Judd (search for "code to multiply")
66 cycles for 8x8 Fastest 6502 Multiplication Yet in AAL by Charles Putney using 1k of tables
90 cycles for 8x8 Fast 6502 multiplication by damian yerrick
8x8 and 16x16 Seriously fast multiplication (8/16 bit signed/unsigned multiplication) - by JackAsser/Instinct (uses 2k of tables)
More multiplication routines at codebase64
Sam Falvo's post covering 6502 and 65816
8x8 Fast multiplication by Fox, originally from Polish disk magazine "Syzygy" no. 6
Shift and add method explained by Jim Butterfield

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 29, 2012 7:15 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
'kay, Ed -- I'll start checking all that out in the morning. I've taken a new tack with my own doodlings, and it'll be interesting to compare my efforts with what's out already there. As you say, many minds have already addressed the problem. I'll be happy if I can simply add a useful selection point on the speed-vs-memory tradeoff.

in your PREVIOUS post you wrote:
if we take an 8-way branch on three bits [...] For 16-bit routines, we'd normally have an average of 8 additions, and we get that down to 6. Is that going to cover the overhead and make it worthwhile?
Either there's a minor slip-up here or else I misunderstand your drift. Is it Radix 8 you were assuming? Although 6 out of 8 possible bit-patterns result in an operation (ie, an add or subtract), you mustn't suppose all patterns are equally probable. If we test the numbers 0 to FFFFh for pattern matching we see...
  • 557,056 operations result using Radix 4, where 2 out of 4 patterns operate
  • 425,984 operations result using Radix 8, where 6 out of 8 patterns operate
  • 335,872 operations result using Radix 16 where 14 out of 16 patterns operate
  • 278,528 operations result using Radix 32 where 30 out of 32 patterns operate
The explanation mostly has to do with the fact that Radix 32 (for example) shifts the number 4 bits at a time (compared to 3, 2, or 1 bit at a time), but there's more to it than that. BTW the Wikipedia article is flawed in that it scarcely even mentions radix -- you're out of luck unless you spot the external links.

For a while I was playing with a radix 32 scheme, but the cost in (zero-page) ram was gonna be pretty high, and that's much less acceptable than simply burning up a bunch of EPROM space.

Quote:
you need a lookup table: seems to need 1k or 2k of table
Lookups are definitely the secret weapon of choice when you need to nuke a speed problem! I've chosen a strategic target, though, and that is shifting -- which is the main bottleneck if you look at the graph I posted. (For the multiply-by-one cases, addition is pretty well out of the picture, and most of the remaining overhead is shifting.) The scheme I'm working on shifts 4 bits at a time, using two 256-byte Lookup Tables to keep things snappy. In the setup it retains the "times 1" multiplicand, of course, and it also prepares a times 2, 4 and 8 version. But that means most of the additions end up spanning 3 bytes instead of two, and it remains to be seen how much the accelerated shifting outweighs that.

cheers

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 29, 2012 8:53 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
I've been thinking about 4-bit shifts by lookup table, so it's good to hear about that!

I don't disbelieve your numbers, but I can't quite see why the patterns shouldn't be evenly distributed. (Is it because of the overlap??)

In that Table 1, it's a Radix 4 approach, but there are 8 patterns because of the need to include the leading bit. 6 of those 8 do some work, and as it happens each does the same amount of work, a single operation, for an average of 0.75 ops per two bits. The two bits we've dealt with would conventionally have cost us an average of one op.

So, in Radix 8 we should have 16 patterns, and again all but 2 will need to do some work, but in this case sometimes we need more than one operation. It seems(*) to take 9 ops per 3 bits consumed, compared to 12 for longhand multiplication. That is, no gain... hmm. (In hardware, it's only 7 ops because plus or minus 3 is arranged to done in a single cycle. In software plus or minus 3 takes two ops.)

At least one of us is missing something! (The material in https://www.google.co.uk/search?q=%22Ra ... gorithm%22 seems helpful.)

Cheers
Ed
(*) See for example page 2 of http://www.ijvspa.net/docs/IJVSPA20120215.pdf


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: