barrym95838 wrote:
That's my best "Rocky" emoji
Make some room on the podium, Mike -- we may have a tie for first place!
Based on random input values, it looks as if my modified right-shift algorithm can produce a full 64-bit result in roughly the same time as your routine takes to produce a 32-bit result. I realize Gordon only requires 32 bits, but it'll be interesting to compare the two routines just the same.
The code below is an 816-ified upscale of the modified-shift '02/'C02 approach I mentioned earlier. While preparing the '816 version I opted for a "plain-vanilla" approach because today I wanted to present something that was clear and easy to read. The original version is smaller because it reuses a single loop, but this new version is simpler and bulkier. Also there's no early-exit logic, and my comments don't have the laconic elegance that Mike's do!
Quote:
41 ticks for 0 * j, 79 or 104 ticks for i * 0, ~2332 ticks for 4294967295 * 4294967295, and typically something closer to 1000 ticks. But I don't have an '816 to test it.
Testing, yes... My
'02/'C02 version actually
has been tested for all 2^32 input combinations; this is something I edited my earlier post to mention. But the code below, while similar, hasn't been tested at all.
As for execution time, most approaches (including Mike's and mine) are strongly affected by the input values, and even the
order of the input values. I'll leave it to the judges to determine the statistical weighting of the input value stream that's deemed relevant for this contest (if it is a contest).
The routine below has a fixed overhead plus a variable penalty that's proportional to how many 1 bits are in the multiplicand passed in FACA. Assuming FACA and FACB are in direct page, I tally the best-case performance to be 812 ticks and worst-case to be [edit] 1548. Thus the average, corresponding to random input, is (drum roll, please!)
1180 cycles. This is virtually identical to
1200, which is about what I take Mike's average performance to be. (Correct me if I'm wrong, Mike. I admit I'm confused by the "79 or 104 ticks for i * 0" among your best-case figures.)
Random input aside, I suspect Mike's early-exit feature will give him an advantage for the Mandelbrot test. I'm unable to estimate the magnitude of that advantage because don't know what values Mandelbrot is likely to favor, but studying Mike's routine I see it exits a little early if one of the operands has, in its MS bits, a short string of consecutive zeroes... and it exits a LOT early if there's a long string of zeroes there! The other operand affects performance in a similar way but the zeroes need to be
right justified (ie, in the LS bits, not MS).
-- Jeff
Code:
; Computed Returns: FACA: the 8-byte value that becomes the 64 bit product (little-endian).
; The "bottom" of FACA also accepts one of the input values.
; Input values: FACA: the 4-byte value that's the 32 bit multiplicand (little-endian)
; FACB: the 4-byte value that's the 32 bit multiplier (little-endian)
imul rep #m_seta|#m_setx ;16-bit accumulator and XY 3
cld 2
;Clear the high 4 bytes of the 8-byte value at faca.
lda #0 ;NB: the top 2 bytes are in A, copied to faca+6 at exit. 3
sta faca+4 ;It helps to think of A as being synonymous with faca+6. 4
ldx #16 ;number of iterations, 1st loop 3
lsr faca+0 ;CY= 1st of 16 bits tested in the 1st loop. 7
;NB:faca+2 is not involved in the 1st loop.
LoopTop1:bcc No_Add1 2 | 3 (Lp)
tay ;Save A 2 | na (Lp)
clc ;Add facb into top 32 bits of faca 2 | na (Lp)
lda faca+4 4 | na (Lp)
adc facb 4 | na (Lp)
sta faca+4 ;low 16 bits done 4 | na (Lp)
tya ;Restore A = "faca+6" 2 | na (Lp)
adc facb+2 ;high 16 bits done 4 | na (Lp)
No_Add1: ; Rotate Carry -> A_aka_faca+6 -> faca+4 -> faca
ror A ;"faca+6" 2 (Lp)
ror faca+4 ; faca+4 7 (Lp)
;; faca+2 is not involved in the 1st loop.
ror faca ; faca+0 7 (Lp)
dex ; 2 (Lp)
bne LoopTop1 ; more iterations for this loop? 3 (Lp)
;
ldx #16 ;number of iterations, 2nd loop 3
lsr faca+2 ;CY= 1st of 16 bits tested in the 2nd loop. 7
;NB:faca+0 is not involved in the 2nd loop.
LoopTop2:bcc No_Add2 2 | 3 (Lp)
tay ;Save A 2 | na (Lp)
clc ;Add facb into top 32 bits of faca 2 | na (Lp)
lda faca+4 4 | na (Lp)
adc facb 4 | na (Lp)
sta faca+4 ;low 16 bits done 4 | na (Lp)
tya ;Restore A = "faca+6" 2 | na (Lp)
adc facb+2 ;high 16 bits done 4 | na (Lp)
No_Add2: ; Rotate Carry -> A_aka_faca+6 -> faca+4 -> faca+2
ror A ;"faca+6" 2 (Lp)
ror faca+4 ; faca+4 7 (Lp)
ror faca+2 ; faca+2 7 (Lp)
;; faca+0 is not involved in the 2nd loop.
dex ; 2 (Lp)
bne LoopTop2 ; more iterations for this loop? 3 (Lp)
sta faca+6 ;done! 4
Exit:
Edits: typo in cycle count stated in the text, booboo in one of the comments in the code