Dr Jefyll wrote:
Based on random input values, it looks as if my modified right-shift algorithm can produce a full 64-bit result in roughly the same time as your routine takes to produce a 32-bit result. I realize Gordon only requires 32 bits, but it'll be interesting to compare the two routines just the same.
Well the quick answer is that it's not quite as fast as Mikes but impressive if it's returning a 64-bit result.
Code:
My original
Mand: 14.104
Bench 8250,7954
Pi 6.489
Copro (Atmega @ 16Mhz)
Mand: 10.977
Bench 11805,11298
Pi 6.148
8x8
Mand: 10.943
Bench 12238,11532
Pi 6.101
BDD
Mand 12.964
Bench 9258,8895
Pi 6.313
MB
Mand 9.336
Bench 13735,13659
Pi 5.585
Jeff
Mand 10.844
Bench 12329,11519
Pi 6.051
I've re-run them all for this after tweaking some of my boilerplate code. in all but the co-pro routines, I check the signs of the numbers, force them to be positive if negative, and keep a flag of the answer sign, then at exit, I negate the result if needed. If I said all I needed was a 31-bit unsigned multiply would it make any difference?
The 2 numbers after bench are 2 runs of the test, one with
I *N, the other with
N * I. I is from 0 to 9999 and N is fixed at 12345. This may favour small numbers in one case (or the other) but it's interesting to see that all routines exhibit different timings based on the order of operands.
And bear in-mind, there is a lot more overhead here in that this is a bytecode interpreted by the '816 although I do a separate null-loop timing to help correct for the actual loop overhead. In a pure '816 ASM environment the multiplies per second will be faster.
The Mandelbrot inner loop looks like:
Code:
WHILE cy <= ymax DO
{
cx := xmin
WHILE cx <= xmax DO
{
LET q=?
LET x,y,x2,y2,iter = 0,0,0,0,0
WHILE iter < maxiter DO
{
IF x2+y2 > 16384 THEN
BREAK
q := x*y
TEST q < 0 THEN
y := (-(-q >> 11)) + cy
ELSE
y := ( q >> 11) + cy
x := x2 - y2 + cx
x2 := (x*x) >> 12
y2 := (y*y) >> 12
iter := iter + 1
}
sawrch (' ' + iter)
cx := cx + dx
}
sawrch ('*n')
cy := cy + dy
}
The
q calculations here are to cope with signed divides - here I use a shift rather than a divide operation - it's all scaled integers, however there are also negative numbers which do add overhead cycles to the code...
the Pi calculation code possibly does more divide and MOD operations than multiply, but this function is called a lot:
Code:
AND divide(src, dst, d, l) BE {
LET rem, n = 0, ?
FOR i = 0 TO l DO
{ n := src%i + 100 * rem
rem := n MOD d
dst%i := n / d
}
}
which is a scaled integer divide on an arbitrary long string of digits. (src%i means get the byte from address src+i)
If anyone is interested in the BCPL code and cintcode, then this is the BCPL code of the 2nd loop in my multiply benchmark:
Code:
FOR i = startVal TO endVal DO
res := i * multiplier
and this is what it compiles into
Code:
303: L16:
303: LL L4 -- Load constant at L4 (12345) into regA
305: LP10 -- regB := regA ; regA := Local variable 10 (stack)
306: MUL -- Multiply regA := regA * regB - Our code!
307: SP3 -- Store regA at local variable 3 (stack)
308: L1 -- regA := 1
309: AP10 -- add local variable 10 into regA
310: SP10 -- store at local variable 10
311: LP11 -- regB := regA ; load local variable 11 into regA
312: JLE L16 -- jump back if <=
I think it's relatively efficient code for a FOR loop - it's just down to the bytecode interpreter to run this as fast as possible. (and incidentally all these opcodes are one byte long with the exception of LL and JLE which are 2 bytes - the 2nd byte is a signed offset to the data/destination so that loop is 11 bytes)
Maybe enough for now unless someone else wants to stand up to the podium?
Cheers,
-Gordon