teamtempest wrote:
That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.
Just to be clear, the 'average' figures I give are the arithmetic mean: the sum of all cycle counts divided by the number of counts. The error bars shown in the graphs show this average as the dot between the minimum and maximum cycle counts. Many of the routines have a low variance in the min and max values (bars of small height). Those that don't is usually because they have special case code for handling zero.
teamtempest wrote:
All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
Good ideas. I wonder if the time to check for these cases is small enough to get win across all inputs? I'd be interested to see. The OR-ing trick is only done by omult1.a I think, but is it worth it? Certainly if you know in advance you are likely to be multiplying small numbers these will win.