Stacking Multiplication and Division similar to ADC/SBC

Proxy · Post by **Proxy** » Wed Jun 15, 2022 6:51 pm

Alright so, a few days ago i designed a 16-bit Hardware Multiplier and Divider that just barely fits into a single ATF1508.
The idea behind it was to have something similar to the 65SPI core, a custom peripheral IC that you program into a CPLD, in this case a very basic interger Math Co-Processor.

But the main issue is, they are only 16 bit operations... so how do you extend that to 32 or 64 bits?
With Addition and Subtraction it's easy, you just stack multiple ADD or SUB Instructions (ADC/SBC in case of the 65xx CPUs) in series from the lowest to the highest byte of the input numbers, cascading the carry/borrow through them all.
But for Multiply and Divide, i got nothing...
A more "formal" way to describe the issue would be:

Code: Select all

Given x-Byte wide MUL and DIV Operations and y-Byte wide Input values, what algorithm or function can be used to calculate the y*2-Byte wide result for MUL, and y-Byte wide Result and Remainder for DIV?
(assuming y > x)

To be clear i'm not really looking for actual code snippets, more just dumbed down descriptions, pseudocode, or similar.
As I would like to actually understand how it works so i could apply that knowledge to other CPUs in the future.

hmm, i wonder if there is a certain set of x and y values where using a pure software routine would be faster than using hardware MUL/DIV, or if the hardware version would always be faster regardless of how small they are compared to the input values.

drogon · Post by **drogon** » Wed Jun 15, 2022 7:53 pm

You may find this thread of interest:

viewtopic.php?f=2&t=6838

There is code there. One version where I implemented a 32x32 multiply using 128KB tables held in RAM was pretty fast.. But "stacking" a 16x16 to give 32x32 is relatively straightforward - this is for the case of a "hardware" (ie. lookup table) for an 8c8 multiply to be extended to a 32x32 multiply:

Code: Select all

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

Code: Select all

; First round: W * PQRS

        doMul8  digW,digS,regW+0
        doMul8b      digR,regW+1
        doMul8b      digQ,regW+2
        doMul8b      digP,regW+3

; Second round: V * SRQ

        doMul8  digV,digS,regW+1
        doMul8b      digR,regW+2
        doMul8b      digQ,regW+3

; Third round: U * SR

        doMul8  digU,digS,regW+2
        doMul8b      digR,regW+3

; fourth round: T * S

        doMul8  digT,digS,regW+3

Now, those are macros, but they're multiply first by 2nd and add the result in 3rd.

this post: viewtopic.php?f=2&t=6838&start=30#p88139 has that actual code, etc.

What I found (and was surprised at!) was that we could implement a multiply on the 6502/'816 that was faster than me asking an ATmega to do it... (the Atmega has a hardware 8x8 multiplier too)

-Gordon

BB8 · Post by **BB8** » Thu Jun 16, 2022 8:16 am

drogon's algo describes a 32x32=32 multiply, while the OP is looking for a 32x32=64 one. So you need to compute all the multiplications:

first round W*PQRS
second round V*PQRS
third round U*PQRS
fourth round T*PQRS

and add them shifted correctly

Proxy · Post by **Proxy** » Fri Jun 17, 2022 11:37 am

BB8 wrote:

drogon's algo describes a 32x32=32 multiply, while the OP is looking for a 32x32=64 one. So you need to compute all the multiplications:

first round W*PQRS
second round V*PQRS
third round U*PQRS
fourth round T*PQRS

and add them shifted correctly

so you just split the input numbers into "x" sized chunks, multiply each part of the first number with each part of the second number, and then add them together somehow.
but could you be more specific about how you shift them "correctly" before adding them to the final result? does the amount you shift depend on the width of the MUL operation or the input values?

from what i can see in drogon's code, each round adds (without carry) to the final result and also each round gets shifted by one additional byte in the result. so each addition to the result overlaps by 1 byte...
except the last operation writes outside of the regW vairable by 1 byte (16 bit STA to regW+3 writes to regW+3 and regW+4), is that intentional or am i mis-reading the code?

i tried to make a small diagram to better understand what is happening, is this correct for drogon's code?
(i used https://app.diagrams.net/ in case anyone was wondering. it's very useful to just quickly throw together something and have it looking "good enough")

drogon · Post by **drogon** » Fri Jun 17, 2022 12:03 pm

Proxy wrote:

BB8 wrote:

drogon's algo describes a 32x32=32 multiply, while the OP is looking for a 32x32=64 one. So you need to compute all the multiplications:

first round W*PQRS
second round V*PQRS
third round U*PQRS
fourth round T*PQRS

and add them shifted correctly

so you just split the input numbers into "x" sized chunks, multiply each part of the first number with each part of the second number, and then add them together somehow.
but could you be more specific about how you shift them "correctly" before adding them to the final result? does the amount you shift depend on the width of the MUL operation or the input values?

from what i can see in drogon's code, each round adds (without carry) to the final result and also each round gets shifted by one additional byte in the result. so each addition to the result overlaps by 1 byte...
except the last operation writes outside of the regW vairable by 1 byte (16 bit STA to regW+3 writes to regW+3 and regW+4), is that intentional or am i mis-reading the code?

regW here is 64-bits wide - that code is "optimised" for a 32x32 multiply to give a 32-bit result, so it's truncating the last (overflow/carry) byte.

The way it works is the same as my school-days long multiplication, so there it was Hundreds, Tens and Units, Here it's er. bytes. (as I have a "hardware" 8x8 lookup table multiplier)

Code: Select all

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

PQRS and TUVW are 32-bit hex numbers - with each 'digit' being $00 through $FF, so it might be: $12345678 where P = $12, Q = $34, R = $56 and S = $78. (ie. each letter is a byte)

Here I start with 'regW' being all zeros. Multiply P Q R S with W and put the result into regW, so

regW[0] = W * S
regW[1] = W * R + carry
regW[2] = W * Q + carry
regW[3] = W * P + carry

The next line, er move up a place, like putting a zero down under the S,W column, here it's a byte to the left.

regW[1] = regW[1] + V * S
regW[2] = regW[2] + V * R + carry
regW[3] = regW[3] + V * Q + carry

At this point, in that code, I abandon the final step as it doesn't contribute to a 32-bit result, but if I needed the 64-bit result, I'd do the last stage:

regW[4] = regW[4] + V * P + carry (noting that the add would be redundant, but I use a macro and I zero'd regW to start with)

And so on.

This: https://www.mathsisfun.com/numbers/mult ... -long.html has a sort of animated version of decimal long multiplication, rather then decimal digits 0-9 we're using hex numbers 0-255.

I think your diagram is on it's side.

-Gordon

Proxy · Post by **Proxy** » Fri Jun 17, 2022 1:51 pm

drogon wrote:

regW here is 64-bits wide - that code is "optimised" for a 32x32 multiply to give a 32-bit result, so it's truncating the last (overflow/carry) byte.

oh i see, i saw that in the code but the line was commented out so i thought it could still lead to out of bounds memory corruption.

drogon wrote:

The way it works is the same as my school-days long multiplication, so there it was Hundreds, Tens and Units, Here it's er. bytes. (as I have a "hardware" 8x8 lookup table multiplier)

Code: Select all

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

PQRS and TUVW are 32-bit hex numbers - with each 'digit' being $00 through $FF, so it might be: $12345678 where P = $12, Q = $34, R = $56 and S = $78. (ie. each letter is a byte)

Here I start with 'regW' being all zeros. Multiply P Q R S with W and put the result into regW, so

regW[0] = W * S
regW[1] = W * R + carry
regW[2] = W * Q + carry
regW[3] = W * P + carry

The next line, er move up a place, like putting a zero down under the S,W column, here it's a byte to the left.

regW[1] = regW[1] + V * S
regW[2] = regW[2] + V * R + carry
regW[3] = regW[3] + V * Q + carry

At this point, in that code, I abandon the final step as it doesn't contribute to a 32-bit result, but if I needed the 64-bit result, I'd do the last stage:

regW[4] = regW[4] + V * P + carry (noting that the add would be redundant, but I use a macro and I zero'd regW to start with)

And so on.

This: https://www.mathsisfun.com/numbers/mult ... -long.html has a sort of animated version of decimal long multiplication, rather then decimal digits 0-9 we're using hex numbers 0-255.

honestly i haven't done long Multiplication in such a loooong time, that link was actually helpful in getting a better understanding of it!

drogon wrote:

I think your diagram is on it's side.

no it's right side up, it's supposed to be regular addition with the line and plus sign at the bottom, each square represents a byte, so the 1x2 rectangles are the 16-bit results from the 8x8 multiplication of the 2 values inside of them. and the 1x4 rectangle at the bottom is the 32-bit output.
but it was still wrong, so i adjusted it and also added some "byte lines" and specified the "rounds" to hopefully make it a bit clearer.

so i assume a full 64-bit result version would look like this (ie the same but without the truncated multiplication "rounds"):

drogon · Post by **drogon** » Fri Jun 17, 2022 5:55 pm

Proxy wrote:

drogon wrote:

I think your diagram is on it's side.

no it's right side up, it's supposed to be regular addition with the line and plus sign at the bottom, each square represents a byte, so the 1x2 rectangles are the 16-bit results from the 8x8 multiplication of the 2 values inside of them. and the 1x4 rectangle at the bottom is the 32-bit output.
but it was still wrong, so i adjusted it and also added some "byte lines" and specified the "rounds" to hopefully make it a bit clearer.

chrome_lY3ddXpjgY.png

so i assume a full 64-bit result version would look like this (ie the same but without the truncated multiplication "rounds"):

chrome_2crI0yCUfQ.png

Ah right - that's it. I sum as we go along while 'traditionally' you'd do all the sums at the end.

In my case I'm actually doing an unsigned multiply - I check both numbers and make them both positive at the start, then adjust the sign at the end, so it's really a 31x31-bit multiply returning a 63-bit result.

I've also no idea if this is the fastest way, but it beat all the other methods I tried in that thread I posted earlier - and if I had 1MB of RAM in my Ruby816 board I'd probably have kept using it, but as it only has 512KB, sacrificing 128KB was a bit much as left little space for the BCPL compiler to run ... but a hardware 8x8 multiplier & divider might be nice...

-Gordon

Proxy · Post by **Proxy** » Fri Jun 17, 2022 8:12 pm

drogon wrote:

Ah right - that's it. I sum as we go along while 'traditionally' you'd do all the sums at the end.

In my case I'm actually doing an unsigned multiply - I check both numbers and make them both positive at the start, then adjust the sign at the end, so it's really a 31x31-bit multiply returning a 63-bit result.

I've also no idea if this is the fastest way, but it beat all the other methods I tried in that thread I posted earlier - and if I had 1MB of RAM in my Ruby816 board I'd probably have kept using it, but as it only has 512KB, sacrificing 128KB was a bit much as left little space for the BCPL compiler to run ... but a hardware 8x8 multiplier & divider might be nice...

-Gordon

yea i'm also just aiming for the unsigned variant, as you said signed multiplication is just a wrapper around unsigned multiplication.
and speaking of lookup tables, i have been thinking about how cheap it would be to use 2x SST39SF0x0 FLASH ICs to hold lookup tables for some math operations and then use a few octal D-Flip Flops to access them (and select between multiple "operations" like 8x8=16 MUL, 8/8=8r8 DIV, maybe even 16->16 SIN, COS, TAN, etc) so the whole thing only takes up around 5 Bytes in memory instead of 256kB-512kB. making it also useful for a 65c02 and avoiding the need to manually calculate the values at runtime.

but that's a topic for some other thread, now that i relearned long multiplication, division is next.
i assume this is also just school long division but done with hex numbers. so i have to relearn that as well.

BigEd · Post by **BigEd** » Fri Jun 17, 2022 8:53 pm

As far as I can tell, long division is usually done as binary - it's rather difficult to do it a nibble or a byte at a time. In other words, things are shifted by a single bit each time around a loop, and operations are word-wide not byte-wide.

In fact multi-byte or multi-word division is often done by an iterative algorithm which uses multiplication, because it's much easier to make a fast multiply, and once you have one (software or hardware) it's efficient to make use of it.

Maybe see https://en.wikipedia.org/wiki/Division_algorithm

GARTHWILSON · Post by **GARTHWILSON** » Fri Jun 17, 2022 9:17 pm

BigEd wrote:

As far as I can tell, long division is usually done as binary - it's rather difficult to do it a nibble or a byte at a time. In other words, things are shifted by a single bit each time around a loop, and operations are word-wide not byte-wide.

In 1986 or '7, when I was very green at this, I wrote some decimal floating-point routines for seven digits plus exponent, and brute-forced it. I handled a nybble at a time, and subtracted repeatedly until it underflowed, reduced the number of subtractions by 1 for that decimal digit's part of the answer, and moved on to the next nybble. It worked fine, but we didn't have particularly harsh performance requirements.

Proxy · Post by **Proxy** » Fri Jun 17, 2022 9:27 pm

BigEd wrote:

As far as I can tell, long division is usually done as binary - it's rather difficult to do it a nibble or a byte at a time. In other words, things are shifted by a single bit each time around a loop, and operations are word-wide not byte-wide.

In fact multi-byte or multi-word division is often done by an iterative algorithm which uses multiplication, because it's much easier to make a fast multiply, and once you have one (software or hardware) it's efficient to make use of it.

Maybe see https://en.wikipedia.org/wiki/Division_algorithm

I used that wikipedia page to implement the hardware division circuit i'm using.
specifically the "Integer division (unsigned) with remainder" part.
and because it's base 2 you don't even need multiplication as shifting once to the left is functionally identical to multiplying by 2. or do you meant some specific binary algorithm that actually needs multiplication to work?

but ultimately does that mean there is no feasible way of extending an DIV operation for larger numbers? a bit disappointing but division has always been a bit of an outlier.
oh well that means 32/32 and 64/64 division have to be done in software while 16/16 and 8/8 can be done blazingly fast in hardware.

Aaendi · Post by **Aaendi** » Sat Jun 18, 2022 2:32 am

drogon wrote:

You may find this thread of interest:

viewtopic.php?f=2&t=6838

There is code there. One version where I implemented a 32x32 multiply using 128KB tables held in RAM was pretty fast.. But "stacking" a 16x16 to give 32x32 is relatively straightforward - this is for the case of a "hardware" (ie. lookup table) for an 8c8 multiply to be extended to a 32x32 multiply:

Code: Select all

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

Code: Select all

; First round: W * PQRS

        doMul8  digW,digS,regW+0
        doMul8b      digR,regW+1
        doMul8b      digQ,regW+2
        doMul8b      digP,regW+3

; Second round: V * SRQ

        doMul8  digV,digS,regW+1
        doMul8b      digR,regW+2
        doMul8b      digQ,regW+3

; Third round: U * SR

        doMul8  digU,digS,regW+2
        doMul8b      digR,regW+3

; fourth round: T * S

        doMul8  digT,digS,regW+3

Now, those are macros, but they're multiply first by 2nd and add the result in 3rd.

this post: viewtopic.php?f=2&t=6838&start=30#p88139 has that actual code, etc.

What I found (and was surprised at!) was that we could implement a multiply on the 6502/'816 that was faster than me asking an ATmega to do it... (the Atmega has a hardware 8x8 multiplier too)

-Gordon

After each 16-bit add, don't you need to carry to the next word?

Proxy · Post by **Proxy** » Sat Jun 18, 2022 7:19 am

nope, because the ADDs overlap on the upper byte of eachother, which effectly acts like a 8-bit wide carry, and the muliplication never sets the carry bit because the result is only ever 16-bits wide.

BigEd · Post by **BigEd** » Sat Jun 18, 2022 7:45 am

Proxy wrote:

do you meant some specific binary algorithm that actually needs multiplication to work?

Yes, see here, there's Newton-Raphson and there's Goldschmidt. (Probably others too)

Quote:

but ultimately does that mean there is no feasible way of extending an DIV operation for larger numbers?

That's my belief, but I don't have a proof (or a citation)

GARTHWILSON wrote:

BigEd wrote:

As far as I can tell, long division is usually done as binary - it's rather difficult to do it a nibble or a byte at a time. In other words, things are shifted by a single bit each time around a loop, and operations are word-wide not byte-wide.

In 1986 or '7, when I was very green at this, I wrote some decimal floating-point routines for seven digits plus exponent, and brute-forced it. I handled a nybble at a time, and subtracted repeatedly until it underflowed, reduced the number of subtractions by 1 for that decimal digit's part of the answer, and moved on to the next nybble. It worked fine, but we didn't have particularly harsh performance requirements.

Right, good point. If you're working in decimal, the story is slightly different. But as you note, it's difficult to be fast. A fundamental step in the long division story for humans is to estimate each digit, then multiply up and subtract. It's that estimation which is difficult to speed up - and that's the cause of Intel's FDIV bug which cost them half a billion in money and who knows what in reputation.

By doing repeated subtraction, you're not estimating in that sense, but you are using up time which is costing speed. (As you point out.)

Edit: in binary, to work a byte at a time, that's base 256 and you might need to subtract 255 times. But to work a nibble at a time is only base 16, and 15 times around the loop mightn't be too bad... except, to do the usual bit-at-a-time takes only 4 iterations to process each nibble, so that would usually be the way to go.

drogon · Post by **drogon** » Sat Jun 18, 2022 4:31 pm

Proxy wrote:

I used that wikipedia page to implement the hardware division circuit i'm using.
specifically the "Integer division (unsigned) with remainder" part.
and because it's base 2 you don't even need multiplication as shifting once to the left is functionally identical to multiplying by 2. or do you meant some specific binary algorithm that actually needs multiplication to work?

but ultimately does that mean there is no feasible way of extending an DIV operation for larger numbers? a bit disappointing but division has always been a bit of an outlier.
oh well that means 32/32 and 64/64 division have to be done in software while 16/16 and 8/8 can be done blazingly fast in hardware.

One persons software is another persons hardware - so given sufficient gates you can do anything... e.g. in the Foenix 816 system there is an FPGA that can do a 32-bit multiply and 32-bit divide in a single cycle (it may be more than one clock cycle, I don't have the exact details, but it can do the operation as fast as you can write the data to the input registers and read it from the output registers...

For my 32-bit divide I did it the "hard" way -

Code: Select all

;********************************************************************************
; DIV
;       A := B / A
; (B is transferred to Q in the code below as the calling stuff needs to preserve B)
; 32-bit 'registers' in Zero/direct page.
;********************************************************************************

... later on:

        .repeat 32
          divStep
        .endrep

... and

.macro  divStep
        .local  div2

        asl     regQ+0          ; Shift high bit of B into remainder
        rol     regQ+2
        rol     regT+0
        rol     regT+2

        lda     regT+0
        sec
        sbc     regA+0
        tay
        lda     regT+2
        sbc     regA+2
        bcc     div2
        sta     regT+2
        sty     regT+0
        inc     regQ+0
div2:
.endmacro

That's just bits extracted from the whole thing though, but you might get the idea..

So for 64-bit's you'd repeat that code 64 times rather than 32 times with the 'registers' & offsets adjusted accordingly.

-Gordon

Stacking Multiplication and Division similar to ADC/SBC

Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC

Re: Stacking Multiplication and Division similar to ADC/SBC