6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 12:19 pm

All times are UTC




Post new topic Reply to topic  [ 44 posts ]  Go to page Previous  1, 2, 3
Author Message
 Post subject: Re: Fast multiplication
PostPosted: Tue Oct 26, 2021 6:23 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
Dr Jefyll wrote:
Based on random input values, it looks as if my modified right-shift algorithm can produce a full 64-bit result in roughly the same time as your routine takes to produce a 32-bit result. I realize Gordon only requires 32 bits, but it'll be interesting to compare the two routines just the same.


Well the quick answer is that it's not quite as fast as Mikes but impressive if it's returning a 64-bit result.

Code:
My original
Mand:   14.104
Bench   8250,7954
Pi      6.489

Copro (Atmega @ 16Mhz)
Mand:   10.977
Bench   11805,11298
Pi      6.148

8x8
Mand:   10.943
Bench   12238,11532
Pi      6.101

BDD
Mand    12.964
Bench   9258,8895
Pi      6.313

MB
Mand    9.336
Bench   13735,13659
Pi      5.585

Jeff
Mand    10.844
Bench   12329,11519
Pi      6.051


I've re-run them all for this after tweaking some of my boilerplate code. in all but the co-pro routines, I check the signs of the numbers, force them to be positive if negative, and keep a flag of the answer sign, then at exit, I negate the result if needed. If I said all I needed was a 31-bit unsigned multiply would it make any difference?

The 2 numbers after bench are 2 runs of the test, one with I *N, the other with N * I. I is from 0 to 9999 and N is fixed at 12345. This may favour small numbers in one case (or the other) but it's interesting to see that all routines exhibit different timings based on the order of operands.

And bear in-mind, there is a lot more overhead here in that this is a bytecode interpreted by the '816 although I do a separate null-loop timing to help correct for the actual loop overhead. In a pure '816 ASM environment the multiplies per second will be faster.

The Mandelbrot inner loop looks like:

Code:
  WHILE cy <= ymax DO
  {
    cx := xmin
    WHILE cx <= xmax DO
    {
      LET q=?
      LET x,y,x2,y2,iter = 0,0,0,0,0

      WHILE iter < maxiter DO
      {
        IF x2+y2 > 16384 THEN
          BREAK

        q := x*y
        TEST q < 0 THEN
          y := (-(-q >> 11)) + cy
        ELSE
          y := (   q >> 11)  + cy

        x    := x2 - y2 + cx
        x2   := (x*x) >> 12
        y2   := (y*y) >> 12

        iter := iter + 1
      }
      sawrch (' ' + iter)
      cx := cx + dx
    }
    sawrch ('*n')

    cy := cy + dy
  }


The q calculations here are to cope with signed divides - here I use a shift rather than a divide operation - it's all scaled integers, however there are also negative numbers which do add overhead cycles to the code...

the Pi calculation code possibly does more divide and MOD operations than multiply, but this function is called a lot:

Code:
AND divide(src, dst, d, l) BE {
  LET rem, n = 0, ?

  FOR i = 0 TO l DO
  { n := src%i + 100 * rem
    rem := n MOD d
    dst%i := n / d
  }
}


which is a scaled integer divide on an arbitrary long string of digits. (src%i means get the byte from address src+i)

If anyone is interested in the BCPL code and cintcode, then this is the BCPL code of the 2nd loop in my multiply benchmark:

Code:
    FOR i = startVal TO endVal DO
      res := i * multiplier


and this is what it compiles into

Code:
 303: L16:
 303:     LL  L4        -- Load constant at L4 (12345) into regA
 305:   LP10            -- regB := regA ; regA := Local variable 10 (stack)
 306:    MUL            -- Multiply regA := regA * regB - Our code!
 307:    SP3            -- Store regA at local variable 3 (stack)
 308:     L1            -- regA := 1
 309:   AP10            -- add local variable 10 into regA
 310:   SP10            -- store at local variable 10
 311:   LP11            -- regB := regA ; load local variable 11 into regA
 312:    JLE  L16       -- jump back if <=


I think it's relatively efficient code for a FOR loop - it's just down to the bytecode interpreter to run this as fast as possible. (and incidentally all these opcodes are one byte long with the exception of LL and JLE which are 2 bytes - the 2nd byte is a signed offset to the data/destination so that loop is 11 bytes)

Maybe enough for now unless someone else wants to stand up to the podium? :-)

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Tue Oct 26, 2021 10:29 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1950
Location: Sacramento, CA, USA
drogon wrote:
... it's interesting to see that all routines exhibit different timings based on the order of operands.
For an extreme example, 2147483648 * 1 is about 2000% slower than 1 * 2147483648 for my subroutine. Luckily, the latter is pretty darned fast.
Quote:
Maybe enough for now unless someone else wants to stand up to the podium? :-)
Are we starting to wear on your nerves?

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 9:51 am 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
barrym95838 wrote:
Are we starting to wear on your nerves?


Haha, no, but lets wrap this up (again!)...

I used a software routine to generate the 1KB table of quarter squares for the original 8x8 multiply, then use that 8x8 multiply to generate the full 128KB table. I did this at boot time - takes just over a second @ 16Mhz...

Then, in my Cintcode VM, I used this table to pull the product out into a modified version of my existing 8x8 multiply code.

Faster - I win ;-)

But at the expense of a second or so boot time and 128KB of RAM...

Code:
8x8ultra:
Mand    8.089
Bench   21503,19377
Pi      5.732


Compared to my original code:

Code:
Original:
Mand:   14.104
Bench   8250,7954
Pi      6.489


Mandelbrot is nearly twice as fast and the 'raw' multiply test is about 2.5 times faster.

But, again, at the expense of 128KB of RAM. My Ruby system has 512KB of RAM with the potential to upgrade easily to 1MB, so I might do that in the near future, but I think, for now, I'll save that RAM for other things and fall-back to MikeB's most excellent, fast and compact code. (Although I can still compile and edit small programs with that 128KB "lost").

Here is the full code for anyone interested.

Code:
;********************************************************************************
; MUL
;       A := A * B
;********************************************************************************

.proc   ccMUL
        .a16
        .i16

; Copy B to Q to preserve B

        copyReg         regB,regQ
        checkSigns      regA,regQ

; Zero the result in 64-bit regW

        stz     regW+0
        stz     regW+2
;       sta     regW+4          ; Not needed
;       sta     regW+6

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

digS    =       regA+0
digR    =       regA+1
digQ    =       regA+2
digP    =       regA+3

digW    =       regQ+0
digV    =       regQ+1
digU    =       regQ+2
digT    =       regQ+3

.macro  doMul8  d1,d2,res
        a8
        lda     d1
        swa
        tay                             ; Temp. save for the next rounds
        lda     d2
        tax

        lda     $60000,x                ; high
        swa
        lda     $70000,x                ; Low   - 16 bit result now in A

        a16
        clc
        adc     res
        sta     res
.endmacro

.macro  doMul8b d2,res
        tya                             ; restore d1 << 8
        a8
        lda     d2
        tax

        lda     $60000,x                ; high
        swa
        lda     $70000,x                ; Low   - 16 bit result now in A

        a16
        clc
        adc     res
        sta     res
.endmacro

; First round: W * PQRS

        doMul8  digW,digS,regW+0
        doMul8b      digR,regW+1
        doMul8b      digQ,regW+2
        doMul8b      digP,regW+3

; Second round: V * SRQ

        doMul8  digV,digS,regW+1
        doMul8b      digR,regW+2
        doMul8b      digQ,regW+3

; Third round: U * SR

        doMul8  digU,digS,regW+2
        doMul8b      digR,regW+3

; fourth round: T * S

        doMul8  digT,digS,regW+3

; result now in regW
; .... copy bottom 32-bits to regA, discarding overflow

        copyReg regW,regA
        fixSignA
        nextOpcode
.endproc


Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 10:18 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
With today's large fast SRAMs, I think large tables and bigger firmware could unlock some performance which wasn't possible back in the day. I think it's quite fruitful!

Having said which, having fast code without huge tables is a win, always.


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 11:07 am 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
BigEd wrote:
With today's large fast SRAMs, I think large tables and bigger firmware could unlock some performance which wasn't possible back in the day. I think it's quite fruitful!

Having said which, having fast code without huge tables is a win, always.


It's ... a very interesting area with some non-obvious solutions too. In my own experiences using a table to lookup something is a "classic" way to do something, however, as CPU speeds increased beyond what RAM could give and on-chip cache became a thing, then sometimes table lookups were slower due to the relative slowness of RAM and the time taken to fill a cache-line... In one application (converting aLaw audio to uLaw) it was faster to compute a result than look it up in a table as the compute code and data fitted entirely inside the CPU cache at that time.

Systems in the early 80's started to use hardware to speed things up - the TI99 could have been an early contender here, also the BLIT "smart" terminal, but the Amiga with it's "blitter" to help speed up graphics was an early contended for what we might now call a GPU... But even then, early 80's RAM was still expensive.

What would help here? An FPGA can do a 32-bit single cycle multiply and divide - e.g. The Foenix 816 board has one as standard and that might be a viable solution for anyone wanting to use the 6502 or '816 in any sort of application where you need faster calculations, but realistically where would you draw the line? Something achievable by the mid 80's (Apple IIgs could be expanded to 8MB of RAM) but by then the 68000 and x86 CPUs were in daily use and the ARM was creeping over the horizon...

It's also worthwhile to look back at some of the games of the early 80's too - Notably Elite on the BBC Micro (and ported to others). Choplifter on the Apple II and many more. They achieved all that with pure software solutions in some very tight memory scenarios.

Where does that leave our retro 6502 and 65816 systems? Personally, I think trying to up-cycle them into something more modern is a nice exercise, and I could write a nicer editor and text formatter or even WISIWYG word processor (remembering that Wordstar was "the thing" in 1978 and visicalc in 1979 running oon computers with barely 64KB of RAM), and do some old style scientific and engineering calculations (with single precision floating point), so a few light nods to that era is good enough for me for now.

I've actually gotten more interested in my retro software environment (ie. my multi-tasking BCPL system) than hardware, so who knows where that will take me next... (bearing in-mind that BCPL was the thing in 1966 and Unix v6 (the first I used) in 1975...

Anyway, drifting off, so I'll leave it at that.

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 11:57 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
drogon wrote:
Systems in the early 80's started to use hardware to speed things up - the TI99 could have been an early contender here,


Other than the TMS9918A video controller, the hardware design was extremely self-penalyzing.

drogon wrote:
What would help here? An FPGA can do a 32-bit single cycle multiply and divide - e.g. The Foenix 816 board has one as standard and that might be a viable solution for anyone wanting to use the 6502 or '816 in any sort of application where you need faster calculations, but realistically where would you draw the line? Something achievable by the mid 80's (Apple IIgs could be expanded to 8MB of RAM) but by then the 68000 and x86 CPUs were in daily use and the ARM was creeping over the horizon...


Someone can design a well-documented, processor-independent math coprocessor which can be added to almost any system using either an FPGA or a fast microcontroller.

The CPU board of my Smoke Signal Broadcasting system had a socket for an AM9511 or AM9512 math chip.

Southwest Technical Products sold an add-on board containing a National Semiconductor calculator chip.

drogon wrote:
It's also worthwhile to look back at some of the games of the early 80's too - Notably Elite on the BBC Micro (and ported to others). Choplifter on the Apple II and many more. They achieved all that with pure software solutions in some very tight memory scenarios.


It is quite amazing how they were able to make these games available on so many different platforms. The code cannot be very portable...

drogon wrote:
Where does that leave our retro 6502 and 65816 systems? Personally, I think trying to up-cycle them into something more modern is a nice exercise, and I could write a nicer editor and text formatter or even WISIWYG word processor (remembering that Wordstar was "the thing" in 1978 and visicalc in 1979 running oon computers with barely 64KB of RAM), and do some old style scientific and engineering calculations (with single precision floating point), so a few light nods to that era is good enough for me for now.


WordStar ran quite well with substantially less than 64K of RAM. It spilled the document onto disk and that was the size limit.

drogon wrote:
I've actually gotten more interested in my retro software environment (ie. my multi-tasking BCPL system) than hardware, so who knows where that will take me next... (bearing in-mind that BCPL was the thing in 1966 and Unix v6 (the first I used) in 1975...


That is why I get my joy implementing software for "classic" processors.


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 2:30 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
drogon wrote:
In one application (converting aLaw audio to uLaw) it was faster to compute a result than look it up in a table as the compute code and data fitted entirely inside the CPU cache at that time.
Yes, it's interesting how the potent solution of one era becomes passe with the advent of a new class of hardware. I think I'd enjoy playing with a 65xx processor that featured on-chip cache... and, who knows, maybe that'll become a reality someday, thanks to soft cores running on FPGA.

As for present-day options, I'm entirely receptive to the idea of using large RAMs as performance enhancers. For example, I'd be inclined to consider extra RAM as an alternative to, say, interfacing a math co-processor to the 65xx. Large RAMs are quite affordable these days, and furthermore they're a more general solution than a chip that's only good at doing math.

Of course I realize not all math problems can reasonably be solved by fetching data from a lookup table. But a lookup table can also contain an array of code snippets, and this is something that's been intriguing me lately.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 4:55 pm 
Offline

Joined: Wed Jun 23, 2021 8:02 am
Posts: 166
Here's a bit of a tweak of Mike's code. There are two main changes:

1. The top word of the first operand is handled separately - you only need to multiply it by the bottom word of the second operand and produce a 16 bit result, which is considerably simpler.
2. The test for no remaining 1 bits in the multiplier is combined with the shift

It's not tested at all - I don't have an 816. But if it works I think it chops around 900 cycles off the worst case.

Code:
;-----------------------------------------------------;
; 32-bit unsigned multiply for the '802/816: i *= j.  ;
; entry:   16-bit mem/acc, 32-bit i, j and k in DP    ;
; exit:    overflow is ignored/discarded, j and k     ;
;            are modified, a = 0                      ;
mul:
    lda i+2         ; 4 ;   high word of multiplier
    beq mul_low     ; 2(3) ; skip next section if it's zero
    sta k+2         ; 4 ;   save it in temporary

; high word of multiplier not zero so deal with that
; multiply high word of i by low word of j, low word of result -> i+2
    lda j           ; 4 ;   low word of j
    sta k           ; 4 ;   save in temporary storage
    lda #0          ; 3 ;   zero partial product
    bra mul_high_2  ; 3 ;   branch to check multiplier bit
    ; up to here 24 cycles, 7 if top word of i was zero

; loop 20 or 25, 0-15 iterations
; last section 12 or 17 cycles
; best case 17, worst case 392
mul_high_1:
    bcc mul_high_3  ; 2(3) ; skip if bit shifted out was 0
    clc             ; 2 ;
    adc k           ; 4 ; if bit was 1, add multiplicand to partial product
mul_high_3:
    asl k           ; 7 ; shift multiplicand left to account for place value of next bit
mul_high_2:
    lsr k+2         ; 7 ;   shift next multiplier bit into C (bit 16...31)
    bne mul_high_1  ; 2(3) ; do another loop if more 1's remaining
    bcc mul_high_4  ; 2(3) ; skip if bit was 0
    clc             ; 2 ;
    adc k           ; 4 ; if bit was 1, add multiplicand to partial product
mul_high_4:

; now deal with low word of i
mul_low:
    sta i+2         ; 4 ; either top word from previous section or zero
    lda i           ; 4 ; low word of multiplier
    stz i           ; 4 ; low word of result zero for now
    sta k           ; 4 ; save
    beq mul_low4    ; 2(3) ; if low word of multiplier zero, finish now
    bra mul_low2    ; 3 ;
    ; 19 cycles + time to mul_low

; loop 27 or 52, 0-15 iterations
; last section 12 or 37 cycles
; best case 12, worst case 817
mul_low1:
    bcc mul_low3    ; 2(3) ; skip if bit shifted out was 0
    clc             ; 2 ; 32 bit add multiplicand to partial product
    lda i           ; 4 ;
    adc j           ; 4 ;
    sta i           ; 4 ;
    lda i+2         ; 4 ;
    adc j+2         ; 4 ;
    sta i+2         ; 4 ;
mul_low_3:
    asl j           ; 7 ; 32 bit shift multiplicand left to account for place value of next bit
    rol j+2         ; 7 ;
mul_low2:
    lsr k           ; 7 ; shift multiplier right into C
    bne mul_low1    ; 2(3) ; do another loop if more 1's remaining
    bcc mul_low4    ; 2(3) ; skip if bit was 0
    clc             ; 2 ; 32 bit add multiplicand to partial product
    lda i           ; 7 ;
    adc j           ; 7 ;
    sta i           ; 7 ;
    lda i+2         ; 7 ;
    adc j+2         ; 7 ;
    sta i+2         ; 7 ;
mul_low4:
    rts             ; 6 ;

; WCET 24+25*15+17+21+52*15+37+6 = 1260 cycles


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 5:54 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1950
Location: Sacramento, CA, USA
Thank you kernelthread! If your mods work as claimed, the extra code is certainly justified. I edited my code comments to place it in the public domain.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Oct 28, 2021 6:06 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Thanks for the license Mike!


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Sun Oct 31, 2021 9:07 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
A few ways to do math are discussed here: http://map.grauw.nl/sources/external/z80bits.html

For multiplication, the 4xy = (x + y)^2 - (x - y)^2 method is my preferred.

Even if its not 6502 code, the discussions are still valid. And you get to write the code yourself as an added bonus :wink: .

(6502 code for above can be found here: http://www.6502.org/source/integers/fastmult.htm )


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Sun Oct 31, 2021 10:07 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
kakemoms wrote:
A few ways to do math are discussed here: http://map.grauw.nl/sources/external/z80bits.html

For multiplication, the 4xy = (x + y)^2 - (x - y)^2 method is my preferred.

Even if its not 6502 code, the discussions are still valid. And you get to write the code yourself as an added bonus :wink: .

(6502 code for above can be found here: http://www.6502.org/source/integers/fastmult.htm )


But if you follow the thread here, it seems that an algorithmic approach is faster - or seems so when extending the basic 8x8 up to 32x32...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Thu Nov 11, 2021 6:09 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I've been thinking about special cases of multiplication recently. This includes multiplication of a large number of values by one value which is unknown before execution, squaring, caching multiplication algorithms and caching multiplication in multi-threaded applications or across multiple interrupt levels.

My primary interest is volume control for a Rational DAC. In this case, all linear PCM samples should be multiplied by a value which changes infrequently. I hoped to reduce overhead by writing out the subset of long multiplication required for any given volume setting but without the conditional branching. To further constrain the problem, I have the luxury of choosing a subset of volume settings such that long multiplication consumes a bounded and relatively constant amount of processing time. This prevents weird things happening at specific volume levels. However, after seeing benchmarks for 4xy = (x + y)^2 - (x - y)^2, it is quite obvious that long multiplication without branching would be inefficient. I mention the technique because it is explored but fruitless.

For a Mandelbrot benchmark, it might be worthwhile to have a dedicated square operation. In this case, the leading diagonal of the multiplication matrix can be obtained directly from tables with no further effort. The remainder can be computed as 2xy because the upper and lower echelon of the multiplication matrix is symmetrical.

With or without a dedicated square operation, it may be desirable to eliminate a 128KB table and instead maintain a 256 entry cache of 8*8 bit multiplication results. If x > y then square operations automatically obtain 100% cache hits on one echelon. Likewise, caching has disproportionate benefit when both inputs for the multiplication are derived from nested loops. Obviously, this is slower than a 128KB table but I believe that the cache only requires three or four pages.

Initially, I thought that it would be worthwhile to have hash key ROL x XOR y. Then, atomically, within one cache line, 8 bit x and 8 bit y are checked in full before collecting a 16 bit result. If the hash algorithm is commutative then the full check may occur in either order. This requires four pages excluding pages to obtain 2xy or 4xy. However, if squaring isn't important then ordering is less important. Therefore, x itself may be the look-up index and only y is checked before collecting a 16 bit result. This only requires three pages. There is the complication that it may or may not be possible to share some of the cache algorithm and some of the 8*8 bit multiply algorithm. Even if it is possible to share, it may not be desirable.

Obtaining (x - y)^2, via look-up table, either requires sign adjustment after subtraction, x > y before subtraction or a double ended look-up table (which is far too cumbersome on 6502). Well, x > y may be most convenient. However, ordering of x and y may or may not be guaranteed by the cache. Given the choices for hash key algorithm, it is not obvious to me whether x and y should be swapped before hash key computation, during cache look-up, after cache miss, or at all. Furthermore, some hash key algorithms skew replacement while others allow duplicate caching of x*y and y*x. It is not obvious to me which hash key algorithm saves the most clock cycles.

I suspect that the best hash key algorithm depends upon input data. In the case of a Mandelbrot benchmark, squaring and nested loops may not present sufficiently general input. Even in this relatively trivial case, high iterations of Mandelbrot calculation may create cache hot-spots or completely churn the cache. Any pixel which requires more than 256 distinct 8*8 bit multiplications may perform worse than no caching at all. Worse, this situation will be averted or forced by the Mandelbrot iteration limit. Caching is also moderately sensitive to loop increment. An increment of 64 units ensures that the bottom byte cycles around four values while the next most significant byte increments by one and more significant bytes are relatively unchanged. Whereas, an increment of 63 units will cycle though all possible values.

There is the further problem that multiplication with cache interacts badly with interrupts. Indeed, I hoped to use fast multiplication inside interrupt to play sound samples at any desired volume. Interrupt duration and interrupt jitter can be minimized by ensuring that cache reads and writes are atomic. And the laziest way of implementing this is one cache for every interrupt level. There's no good solution here. However, similar problems occur with a hardware memory mapped multiplier and options to avert them may be more limited.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
 Post subject: Re: Fast multiplication
PostPosted: Wed Mar 29, 2023 9:24 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Just parking a forward reference to a subsequent thread:
Comparing 6502 multiply routines


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 44 posts ]  Go to page Previous  1, 2, 3

All times are UTC


Who is online

Users browsing this forum: barnacle, rwiker and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: