fast wide BCD add

White Flame · Post by **White Flame** » Tue Jan 14, 2014 5:06 am

The version I showed is completely parallel, from a software perspective that assumes full-word addition is atomic. Basically it does this:

Code: Select all

 $00200180 ; a
+$00900090 ; b
==========
 $00b00210 ; a+b, raw binary addition

 $00100010 ; nybbles with detected carries, calculated by some fixed shifting & masking operations
 $00600060 ; multiplied by 6, this is the number to add to do the fixup

 $00b00210 ; raw binary a+b
+$00600060 ; fixup value
=========
 $01100270 ; the decimal-adjusted addition

This is a fixed number of steps, no matter how wide your registers are, and there are no branches to test. This should be faster than serial testing & adjustment in software in all 32+ bit cases (and likely in 16 bit too). The secondary addition performs all the fixup in a single operation.

EDIT: Yeah, this performs the fixup per nybble but cannot cascade the fixup's carry propagation through multiple nybbles, like in the $099 + $001 case. I could think more on that, but using this same strategy it would end up running successive iterations until no nybble overflow is detected. That would likely still be faster than always checking each nybble in series, but would be variable in time and adds some complexity to the situation. It should give you a bit of an example of how to do math in parallel, though.

BitWise · Post by **BitWise** » Tue Jan 14, 2014 10:46 am

Mike Colishaw (IBM) is a noted expert on BCD Math. He maintains a set of links here
http://speleotrove.com/decimal/

White Flame · Post by **White Flame** » Tue Jan 14, 2014 10:47 am

So what the heck, here's an untested C implementation of my idea, which fully ripples the half-carry when needed, that you should be able to drop right into your code to test heavily. It's actually only a few lines long, if you remove comments and combine the nybbleFlags calculation together.

Code: Select all

if (p & D_FLAG)
{
    temp = A + operand + c;

    /* Detect nybbles that have binary overflowed (16-18) */
    WORD nybbleFlags = (A ^ operand ^ temp) >> 4;

    /* Detect decimally overflowing nybbles (10-15) */
    nybbleFlags |= (((temp>>1)&temp)>>2) | (((temp>>2)&temp)>>1);

    /* Flags have been shifted to bit0 of each nybble; mask out all other bits */
    nybbleFlags &= 0x11111111;

    while(nybbleFlags != 0)
    {
       /* Convert each nybble from a 0/1 flag to the fixup number, and add it in */
       temp += nybbleFlags * 6;
   
       /* Test again to see if any nybbles are 10 ($a).  This is the only overflow that can ripple at this point. */
       nybbleFlags = (((temp>>2)&temp)>>1) & 0x11111111;
    }
}

This leaves the 64-bit result in temp, suitable for the rest of the function to extract carry & defer to the lda() call.

I didn't do any optimization in terms of reducing the number of shifts. The nybbleFlags bits wouldn't be at bit0 anymore and it would look a bit weirder. nybbleFlags might be slightly faster as a DWORD, too, since it's always in play with temp.

It's still all in parallel, so doing $09990999 + $00010001 should only ripple 3 times.

White Flame · Post by **White Flame** » Tue Jan 14, 2014 12:34 pm

I googled around some more and found this: https://groups.google.com/forum/#!topic ... 6EnT7wV1uA

He pre-adds 6's to each nybble, so that automatically ripples the nybble-wise additions, then backs off affected nybbles with subtraction which should not carry between nybbles. This eliminates any software loop.

barrym95838 · Post by **barrym95838** » Tue Jan 14, 2014 3:12 pm

Good stuff, White Flame!

I'm sure that getting rid of my huge trail of & operations should speed it up nicely, and I'll do my best to incorporate it. I still haven't compiled my source yet, due to some last-minute spec changes and holes in my shift and rotate functions, but I'll let you know how it goes, when the time comes.

Thanks,

Mike

barrym95838 · Post by **barrym95838** » Tue Jan 14, 2014 6:36 pm

@MichaelM: I forgot to thank you as well for your input. Unfortunately, your discussion with BigEd is too advanced for me to attempt to participate.

Mike

BigEd · Post by **BigEd** » Tue Jan 14, 2014 7:39 pm

Thanks for that link White Flame. I don't fully understand but it's an interesting idea to pre-add the 6's. I think one of the patents did that too.

Thanks also BitWise for your link - I think I've been there before but not recently and not exhaustively. I should explore the site.

I feel I should make some orientation points:
- it's looking like BCD is of limited use especially with only add and subtract
- a BCD adder will always be slower than a binary adder
- if the full-word add is close to being critical for speed then a 2-cycle BCD operation is a good idea
- my effort at verilog is not tested!

So my efforts to cook up a fast BCD adder are pretty much just an intellectual exercise: the original minimal 65Org16 doesn't bother with BCD.

@MichaelM: thanks for your detailed comments. I know I'd previously looked at your design and noted the set of constants for decimal adjust. You're quite right that serial machines, or nibble-serial machines, have an advantage here. (The Z80 is nibble-serial, but presumably for other reasons, as it has a decimal adjust instruction.) I see now why you have an extra cycle and indeed there's no harm in that. It's true that I haven't tackled subtraction. I think it's not a great extra speed cost because the conditional subtraction of 6 that's needed does not affect the carries. So, if there weren't hardware there to perform BCD addition, the subtraction could be a straight binary subtractor. It does seem likely that including subtraction will slow down the adder/subtractor, making a non-ripple architecture even more necessary.

Quote:

To address your conjecture regarding the simplification of BCD addition with increasing width...

Ah, no indeed, I did not mean to say that BCD addition becomes simpler. I meant to say that a narrow BCD adder has less of a problem than a wide one. The advanced adders are not likely to be much of a win for a narrow BCD adder, and are likely to be a bigger win for a wide one. My point was that the linked paper only considered an 8-bit wide adder, which may support different conclusions than considering a 32-bit wide adder.

I completely agree that BCD addition requires the rippling of adjusted carries: for example 44444445 + 55555555 demonstrates that. My construction connects up the nibble-units in some kind of carry chain: either a straightforward ripple or a carry-select approach.

Hope that's helped
Cheers
Ed

Quote:

I guess this thread is supposed to be about Mike's original question?

Regrettably, no! I'd intended to leave software discussions on the original thread and explore a hardware discussion here. No harm done - we just end up with an interleaved pair of conversations.

barrym95838 · Post by **barrym95838** » Wed Jan 15, 2014 9:06 am

BigEd wrote:

... Regrettably, no! I'd intended to leave software discussions on the original thread and explore a hardware discussion here. No harm done - we just end up with an interleaved pair of conversations.

Probably my fault, for not noticing that we were in 'Programmable Logic' instead of 'Programming' (or 'Simulation'). But, since the milk has already been spilled, I offer the following (with apologies):

Code: Select all

static void adc()
{
    DWORD
        temp = A + operand;
    if (p & D_FLAG)
    {
        DWORD         /* bias (A + operand) to detect any nybble carries */
            t2 = (temp + 0x66666666UL) ^ A ^ operand;
                            /* apply corrections to the affected nybbles */
        t2 = ((t2 >> 3) & 0x22222222UL) * 3 + temp;
                /* bias ((A + operand) + c) to detect any nybble carries */
        temp = (t2 + c + 0x66666666UL) ^ t2 ^ c;
                            /* apply corrections to the affected nybbles */
        temp = ((temp >> 3) & 0x22222222UL) * 3 + t2 + c;        
                                      /* detect tens-complement overflow */
        if ((A < 0x50000000UL) == (operand < 0x50000000UL)) &&
          (((WORD) temp < 0x50000000UL) != (operand < 0x50000000UL))
            setV();
        else
            clrV();
    }
    else
    {
        temp += c;
        if ((A & BIT_31) == (operand & BIT_31)) &&
          ((temp & BIT_31) != (operand & BIT_31))
            setV();
        else
            clrV();
    }
    operand = (WORD) temp;
    c = (WORD) (temp >> 32);
    lda();
}

This is untested, but I think that I can do a 32-bit + 32-bit + 32-bit BCD addition with the correct BCD answer, thanks to White Flame's links and hints.

Mike

[Edit: Fixed the parentheses in the overflow conditional expression.]

BigEd · Post by **BigEd** » Thu Jan 16, 2014 8:45 am

I've managed to understand it, and I like it! You do the addition with a bias of 6, use parity to determine the carries, then construct an adjustment using the carries as a mask. Or something very like that. If your carry in were a single bit, or even just one bit per nibble, I think you could do all the work in one pass. If the the carry is composed of decimal digits, you need two passes (as you have.)
Cheers
Ed

barrym95838 · Post by **barrym95838** » Thu Jan 16, 2014 9:10 am

You're probably going to find this amusing, but I studied the x86 version from White Flame's link, de-compiled it to C, figured out what it was doing, re-arranged it into 'proper' C, then modified it for my purposes. I couldn't figure out what the x86 code was doing until after I 'blindly' translated it to C.

Mike

BigEd · Post by **BigEd** » Thu Jan 16, 2014 7:12 pm

I never could get the hang of 8-series registers and their names - it gets worse as you move up to 32bit and 64bit x86... so I can see how C would make a lot more sense!

One idea I had was to mask out alternate nibbles, so you can get 8 bits of result for each nibble. Mask the inputs (or an intermediate result) with 0f0f0f0f and f0f0f0f0, do the work and then recombine. But it won't approach the efficiency of this method.

You are of course spoilt by having a 64 bit datatype to work in!

White Flame · Post by **White Flame** » Thu Jan 16, 2014 10:54 pm

BigEd wrote:

You are of course spoilt by having a 64 bit datatype to work in!

It still really annoys me that no high-level language (that I know of) exposes the carry flag for use in chaining together separate operations.

However, in a discussion on the Mill CPU (a very interesting new architecture), the question came up on what is carry actually used for on modern machines? The only results that came up were emulators (which is just legacy compatibility), bignum ops, and a boolean parameter in ABIs. Most of the use of carry in small CPUs is for optimization and tricks to get around the lack of register width or powerful instructions.

BigEd · Post by **BigEd** » Thu Jan 16, 2014 10:59 pm

Yes, bignums is the obvious one. After a subtraction or compare it tells you the result - it's a volatile but free boolean. In a machine with extreme register pressure like the 6502 that's highly valuable. Not sure if it's often used after an addition other than for carrying into a multi-word result, but of course it does tell you that an unsigned addition overflowed. Good point about ABIs too - same comment about register pressure. So a machine with adequate number and width of registers can do without. I believe a status register is a pain for aggressive microarchitectures.

GARTHWILSON · Post by **GARTHWILSON** » Fri Jan 17, 2014 1:35 am

White Flame wrote:

It still really annoys me that no high-level language (that I know of) exposes the carry flag for use in chaining together separate operations.

It would be pretty trivial to write Forth primitives that do it.

Forth lets you define and re-define even your own operators like + - etc..

BigEd · Post by **BigEd** » Fri Jan 17, 2014 9:57 am

White Flame wrote:

... the Mill CPU (a very interesting new architecture)

I've just posted about this at http://anycpu.org/forum/viewtopic.php?f=3&t=109 so we can discuss it there if we like.
Cheers
Ed

fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add

Re: fast wide BCD add