6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 01, 2024 1:28 am

All times are UTC




Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2
Author Message
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 5:06 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
The version I showed is completely parallel, from a software perspective that assumes full-word addition is atomic. Basically it does this:

Code:
 $00200180 ; a
+$00900090 ; b
==========
 $00b00210 ; a+b, raw binary addition

 $00100010 ; nybbles with detected carries, calculated by some fixed shifting & masking operations
 $00600060 ; multiplied by 6, this is the number to add to do the fixup

 $00b00210 ; raw binary a+b
+$00600060 ; fixup value
=========
 $01100270 ; the decimal-adjusted addition


This is a fixed number of steps, no matter how wide your registers are, and there are no branches to test. This should be faster than serial testing & adjustment in software in all 32+ bit cases (and likely in 16 bit too). The secondary addition performs all the fixup in a single operation.




EDIT: Yeah, this performs the fixup per nybble but cannot cascade the fixup's carry propagation through multiple nybbles, like in the $099 + $001 case. I could think more on that, but using this same strategy it would end up running successive iterations until no nybble overflow is detected. That would likely still be faster than always checking each nybble in series, but would be variable in time and adds some complexity to the situation. It should give you a bit of an example of how to do math in parallel, though.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 10:46 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
Mike Colishaw (IBM) is a noted expert on BCD Math. He maintains a set of links here
http://speleotrove.com/decimal/

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 10:47 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
So what the heck, here's an untested C implementation of my idea, which fully ripples the half-carry when needed, that you should be able to drop right into your code to test heavily. It's actually only a few lines long, if you remove comments and combine the nybbleFlags calculation together.

Code:
if (p & D_FLAG)
{
    temp = A + operand + c;

    /* Detect nybbles that have binary overflowed (16-18) */
    WORD nybbleFlags = (A ^ operand ^ temp) >> 4;

    /* Detect decimally overflowing nybbles (10-15) */
    nybbleFlags |= (((temp>>1)&temp)>>2) | (((temp>>2)&temp)>>1);

    /* Flags have been shifted to bit0 of each nybble; mask out all other bits */
    nybbleFlags &= 0x11111111;

    while(nybbleFlags != 0)
    {
       /* Convert each nybble from a 0/1 flag to the fixup number, and add it in */
       temp += nybbleFlags * 6;
   
       /* Test again to see if any nybbles are 10 ($a).  This is the only overflow that can ripple at this point. */
       nybbleFlags = (((temp>>2)&temp)>>1) & 0x11111111;
    }
}


This leaves the 64-bit result in temp, suitable for the rest of the function to extract carry & defer to the lda() call.


I didn't do any optimization in terms of reducing the number of shifts. The nybbleFlags bits wouldn't be at bit0 anymore and it would look a bit weirder. nybbleFlags might be slightly faster as a DWORD, too, since it's always in play with temp.

It's still all in parallel, so doing $09990999 + $00010001 should only ripple 3 times.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 12:34 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
I googled around some more and found this: https://groups.google.com/forum/#!topic/comp.lang.asm.x86/26EnT7wV1uA

He pre-adds 6's to each nybble, so that automatically ripples the nybble-wise additions, then backs off affected nybbles with subtraction which should not carry between nybbles. This eliminates any software loop.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 3:12 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Good stuff, White Flame!

I'm sure that getting rid of my huge trail of & operations should speed it up nicely, and I'll do my best to incorporate it. I still haven't compiled my source yet, due to some last-minute spec changes and holes in my shift and rotate functions, but I'll let you know how it goes, when the time comes.

Thanks,

Mike


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 6:36 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
@MichaelM: I forgot to thank you as well for your input. Unfortunately, your discussion with BigEd is too advanced for me to attempt to participate.

Mike


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Tue Jan 14, 2014 7:39 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
Thanks for that link White Flame. I don't fully understand but it's an interesting idea to pre-add the 6's. I think one of the patents did that too.

Thanks also BitWise for your link - I think I've been there before but not recently and not exhaustively. I should explore the site.

I feel I should make some orientation points:
- it's looking like BCD is of limited use especially with only add and subtract
- a BCD adder will always be slower than a binary adder
- if the full-word add is close to being critical for speed then a 2-cycle BCD operation is a good idea
- my effort at verilog is not tested!

So my efforts to cook up a fast BCD adder are pretty much just an intellectual exercise: the original minimal 65Org16 doesn't bother with BCD.

@MichaelM: thanks for your detailed comments. I know I'd previously looked at your design and noted the set of constants for decimal adjust. You're quite right that serial machines, or nibble-serial machines, have an advantage here. (The Z80 is nibble-serial, but presumably for other reasons, as it has a decimal adjust instruction.) I see now why you have an extra cycle and indeed there's no harm in that. It's true that I haven't tackled subtraction. I think it's not a great extra speed cost because the conditional subtraction of 6 that's needed does not affect the carries. So, if there weren't hardware there to perform BCD addition, the subtraction could be a straight binary subtractor. It does seem likely that including subtraction will slow down the adder/subtractor, making a non-ripple architecture even more necessary.

Quote:
To address your conjecture regarding the simplification of BCD addition with increasing width...

Ah, no indeed, I did not mean to say that BCD addition becomes simpler. I meant to say that a narrow BCD adder has less of a problem than a wide one. The advanced adders are not likely to be much of a win for a narrow BCD adder, and are likely to be a bigger win for a wide one. My point was that the linked paper only considered an 8-bit wide adder, which may support different conclusions than considering a 32-bit wide adder.

I completely agree that BCD addition requires the rippling of adjusted carries: for example 44444445 + 55555555 demonstrates that. My construction connects up the nibble-units in some kind of carry chain: either a straightforward ripple or a carry-select approach.

Hope that's helped
Cheers
Ed

Quote:
I guess this thread is supposed to be about Mike's original question?

Regrettably, no! I'd intended to leave software discussions on the original thread and explore a hardware discussion here. No harm done - we just end up with an interleaved pair of conversations.


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Wed Jan 15, 2014 9:06 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
BigEd wrote:
... Regrettably, no! I'd intended to leave software discussions on the original thread and explore a hardware discussion here. No harm done - we just end up with an interleaved pair of conversations.


Probably my fault, for not noticing that we were in 'Programmable Logic' instead of 'Programming' (or 'Simulation'). But, since the milk has already been spilled, I offer the following (with apologies):

Code:
static void adc()
{
    DWORD
        temp = A + operand;
    if (p & D_FLAG)
    {
        DWORD         /* bias (A + operand) to detect any nybble carries */
            t2 = (temp + 0x66666666UL) ^ A ^ operand;
                            /* apply corrections to the affected nybbles */
        t2 = ((t2 >> 3) & 0x22222222UL) * 3 + temp;
                /* bias ((A + operand) + c) to detect any nybble carries */
        temp = (t2 + c + 0x66666666UL) ^ t2 ^ c;
                            /* apply corrections to the affected nybbles */
        temp = ((temp >> 3) & 0x22222222UL) * 3 + t2 + c;       
                                      /* detect tens-complement overflow */
        if ((A < 0x50000000UL) == (operand < 0x50000000UL)) &&
          (((WORD) temp < 0x50000000UL) != (operand < 0x50000000UL))
            setV();
        else
            clrV();
    }
    else
    {
        temp += c;
        if ((A & BIT_31) == (operand & BIT_31)) &&
          ((temp & BIT_31) != (operand & BIT_31))
            setV();
        else
            clrV();
    }
    operand = (WORD) temp;
    c = (WORD) (temp >> 32);
    lda();
}

This is untested, but I think that I can do a 32-bit + 32-bit + 32-bit BCD addition with the correct BCD answer, thanks to White Flame's links and hints.

Mike

[Edit: Fixed the parentheses in the overflow conditional expression.]


Last edited by barrym95838 on Thu Jan 16, 2014 9:01 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Thu Jan 16, 2014 8:45 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
I've managed to understand it, and I like it! You do the addition with a bias of 6, use parity to determine the carries, then construct an adjustment using the carries as a mask. Or something very like that. If your carry in were a single bit, or even just one bit per nibble, I think you could do all the work in one pass. If the the carry is composed of decimal digits, you need two passes (as you have.)
Cheers
Ed


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Thu Jan 16, 2014 9:10 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
You're probably going to find this amusing, but I studied the x86 version from White Flame's link, de-compiled it to C, figured out what it was doing, re-arranged it into 'proper' C, then modified it for my purposes. I couldn't figure out what the x86 code was doing until after I 'blindly' translated it to C.

Mike


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Thu Jan 16, 2014 7:12 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
I never could get the hang of 8-series registers and their names - it gets worse as you move up to 32bit and 64bit x86... so I can see how C would make a lot more sense!

One idea I had was to mask out alternate nibbles, so you can get 8 bits of result for each nibble. Mask the inputs (or an intermediate result) with 0f0f0f0f and f0f0f0f0, do the work and then recombine. But it won't approach the efficiency of this method.

You are of course spoilt by having a 64 bit datatype to work in!


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Thu Jan 16, 2014 10:54 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
BigEd wrote:
You are of course spoilt by having a 64 bit datatype to work in!


It still really annoys me that no high-level language (that I know of) exposes the carry flag for use in chaining together separate operations.

However, in a discussion on the Mill CPU (a very interesting new architecture), the question came up on what is carry actually used for on modern machines? The only results that came up were emulators (which is just legacy compatibility), bignum ops, and a boolean parameter in ABIs. Most of the use of carry in small CPUs is for optimization and tricks to get around the lack of register width or powerful instructions.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Thu Jan 16, 2014 10:59 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
Yes, bignums is the obvious one. After a subtraction or compare it tells you the result - it's a volatile but free boolean. In a machine with extreme register pressure like the 6502 that's highly valuable. Not sure if it's often used after an addition other than for carrying into a multi-word result, but of course it does tell you that an unsigned addition overflowed. Good point about ABIs too - same comment about register pressure. So a machine with adequate number and width of registers can do without. I believe a status register is a pain for aggressive microarchitectures.


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Fri Jan 17, 2014 1:35 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8534
Location: Southern California
White Flame wrote:
It still really annoys me that no high-level language (that I know of) exposes the carry flag for use in chaining together separate operations.

It would be pretty trivial to write Forth primitives that do it. :D Forth lets you define and re-define even your own operators like + - etc..

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: fast wide BCD add
PostPosted: Fri Jan 17, 2014 9:57 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
White Flame wrote:
... the Mill CPU (a very interesting new architecture)

I've just posted about this at http://anycpu.org/forum/viewtopic.php?f=3&t=109 so we can discuss it there if we like.
Cheers
Ed


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 30 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 21 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: