6502.org • View topic - fast wide BCD add

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

fast wide BCD add

Page 1 of 2

[ 30 posts ]

Go to page 1, 2 Next

Print view

Previous topic | Next topic

Author

Message

BigEd

Post subject: fast wide BCD add

Posted: Sun Jan 12, 2014 7:59 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

As noted elsewhere, BCD addition is difficult to get right, and requires that each nibble produces a carry to the next.

For a wide machine, rippling the carry up 8 nibbles seems like it might limit the Max speed of the CPU.

I think that a carry-select approach should get over the speed problem and not be too difficult to code... But I'm opening this thread so we can discuss different approaches.

Cheers
Ed

Top

barrym95838

Post subject: Re: fast wide BCD add

Posted: Sun Jan 12, 2014 10:14 pm

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA

Thanks, Ed.

If you feel inclined to do so, could you elaborate a bit on this 'carry-select' approach, with examples, or links, or links to examples? You're very good at finding good links among the search engine noise.

Thanks,

Mike

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Sun Jan 12, 2014 10:37 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

I'll see if I can get this straight! The problem with carry is that even the MSB result depends on the carryin to the LSB addition, and you don't want to have to wait.

So, with carry-select, your basic tactic is to perform two additions, one with carryin true and the other with carryin false. It's a low-cost mux to choose between the two results, and the input of the mux is the actual carryin.

Let's say you're tackling a 32-bit binary addition. You split into top 16 bits and bottom 16 bits. The bottom 16 bits you add conventionally, which costs you 16 bits of carry propagation time (if you don't do anything clever.) The top 16 bits you compute both with and without carry, which again takes just 16 bits of propagation time, because in both cases the carryin is constant. You feed the two 17-bit results into a 17x2 mux, and the control input of the mux comes from the bottom half's carryout. As a consequence, you get the full 33-bit result just one mux delay after the 16-bit carry: the top half and bottom half additions took place in parallel. In area it cost you a little more than 3 times a 16-bit adder.

It turns out you can use the same trick to perform each of the 16-bit additions, if you need to: you use 3 times an 8-bit adder for each of the 16-bit additions, to get a 16-bit result in one mux delay after an 8-bit result.

Now your 3 times 16-bit adders has become 9 times 8-bit adders, but you get your full 33-bit result two mux delays after an 8-bit result.

In the case of BCD, we need to work on nibbles, and for each nibble we need to add the two inputs and conditionally adjust the result and the carry by adding 6.

But I think the same logic applies: if we can't wait for all 8 nibbles to propagate their carries up to the final carryout, we can divide and conquer. If we're lucky we just have to do 3 times 4-nibble additions, or maybe we need to do 9 times 2-nibble additions.

There are diagrams at http://en.wikipedia.org/wiki/Carry-select_adder ... it turns out that applying the technique recursively is called Conditional Sum Adder. I suspect I once knew that!

I did find a bunch of papers and even a patent or two: it turns out IBM do something terribly clever for their wide BCD arithmetic. But my suspicion is that our FPGAs are quite big enough for us not to need to be too clever about using minimal logic: we're more likely to be chasing speed.

Try searching for [carry select bcd patent] if you want to dig deeper - the topic of fast and small ALUs is very rich for ever more sophisticated approaches. Fortunately you only need something good enough for your purpose.

Hope this helps
Ed

Top

barrym95838

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 1:28 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA

Here's an excerpt from my first amateurish attempt. Even with the loop unrolled, it looks brutally slow. Note that I am experimenting with a 32-bit carry register in place of a 1-bit carry flag. WORD is defined as uint32_t and DWORD is uint64_t ...

Code:

static void adc()
{
    DWORD
        temp;
    if (p & D_FLAG)
    {                                   /* ripple the BCD nybble carries */
        temp = (A & NYB_0) + (operand & NYB_0) + (c & NYB_0);
        if (temp >= 0x0000000aUL)
            temp += 0x00000006UL;
        temp += (A & NYB_1) + (operand & NYB_1) + (c & NYB_1);
        if (temp >= 0x000000a0UL)
            temp += 0x00000060UL;
        temp += (A & NYB_2) + (operand & NYB_2) + (c & NYB_2);
        if (temp >= 0x00000a00UL)
            temp += 0x00000600UL;
        temp += (A & NYB_3) + (operand & NYB_3) + (c & NYB_3);
        if (temp >= 0x0000a000UL)
            temp += 0x00006000UL;
        temp += (A & NYB_4) + (operand & NYB_4) + (c & NYB_4);
        if (temp >= 0x000a0000UL)
            temp += 0x00060000UL;
        temp += (A & NYB_5) + (operand & NYB_5) + (c & NYB_5);
        if (temp >= 0x00a00000UL)
            temp += 0x00600000UL;
        temp += (A & NYB_6) + (operand & NYB_6) + (c & NYB_6);
        if (temp >= 0x0a000000UL)
            temp += 0x06000000UL;
        temp += (A & NYB_7) + (operand & NYB_7) + (c & NYB_7);
        if (temp >= 0xa0000000UL)
            temp += 0x60000000UL;
        if ((A < 0x50000000UL) == (operand < 0x50000000UL)) &&
          ((temp < 0x50000000UL) != (operand < 0x50000000UL))
            setV();
        else
            clrV();
    }
    else
    {
        temp = A + operand + c;
        if ((A & BIT_31) == (operand & BIT_31)) &&
          ((temp & BIT_31) != (operand & BIT_31))
            setV();
        else
            clrV();
    }
    operand = (WORD) temp;
    c = (WORD) (temp >> 32);
    lda();
}

If you promise to keep the snickering to a minimum, I'd love to hear your opinions about how I could improve its efficiency.

Mike

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 10:13 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Ah, well... if you're writing an emulator, there's no gain in trying to parallelise the carry computation - I was thinking of hardware designs.
But of course it's still worth getting BCD right! And I'm pretty sure you have to compute each nibble result before you move onto the next one, moving from LSB to MSB.
And it looks like you do exactly that! Comments:
- I'm not sure about bringing c into anything other than the LSB calculation. Would it ever be a value other than 0 or 1?
- Computing V is a shot in the dark - I don't think there's any accepted spec of what V would mean, as the BCD values are not signed. Taking them to be 10's complement, which is I think what you did, might be as good a spec as any.
- Computing the output value of c as you do means that c needs to be more than 32 bits.

Your best bet, I think, is to cook up some test cases - especially for the C and V calculations.

Cheers
Ed

Top

MichaelM

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 2:19 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

BigEd:

I think that what you have in mind for speeding the calculation of the wide BCD sums is not particularly well suited for FPGA implementation. I expect that all of that additional logic required by these advanced adder architectures will result in inefficiencies getting onto and off of the local carry chains which will override the theoretical benefits that accrue to these advanced adder architectures when mapped onto a typical LUT-based FPGA.

The carry select method operates as you described earlier. The issue is the duplication of the summers. Invariably, the binary sum will require two cycles to compute. In addition, the BCD adjustment phase will similarly require two cycles to perform its operation. All total, if a carry select architecture is applied to both phases of a standard BCD adder/subtracter, then an equivalent of 4 binary adders are required to carry out the BCD operation.

Another well known technique that relies on pipelining the carries is the carry save technique. Like the carry select approach, the adder is divided into segments whose ripple carry time is acceptable. Unlike the carry select technique, the carry out of each segment is pipelined into the next stage on the subsequent cycle.

Thus, either technique requires two addition stages. The carry select essentially duplicates the logic of the adder at the first stage, and the carry save requires a full adder on the first stage and a simplified adder in the second stage. Internal to each stage, the carry out is generated by ripple carry logic. The carry select technique has the advantage in total delay to the carry out of the most significant stage because it is calculate in the first stage and simply selected in the second stage. The carry out of the most significant stage in a carry save adder is generated as a ripple carry out of the most significant stage's simplified second level adder.

Joseph Cavanagh's classic computer science text book provides detailed explanations of these types of circuits. (Note: He also has put out a text book on Verilog, but although I have it in my library, I wouldn't recommend it. Most recently, he combined both books into a single book, and included a new section on floating point arithmetic.)

I found this paper, "Design and Performance Analysis of Various Adders using Verilog", in which Figure 15 summarizes the number of LUTs, Slices, and delays expected for the various adders. The paper also analyzes two types of adders which I had not heard about before, but one of them, the Carry Increment adder, resembles the solution I had in mind above for the carry save adder: a full function adder followed by a simplified adder. The simplified adder is an incrementer.

In any case, Table I summarizes the results in tabular form. The ripple carry adder (RCA) uses the lowest amount of FPGA resources, but is not the slowest. The fastest adder is the carry save adder (CSA), and the carry increment adder (CIA) coming in a close second. The carry look ahead (CLA), usually touted as the most complex and resource intensive, is actually not as much of a resource hog as the carry save adder. (Funny how some complex logic maps onto the LUTs of FPGAs in a manner that is simpler in unexpected ways.)

The paper supports the point that I made above regarding the mapping of adders onto the FPGA. My point is that the carry propagation logic in the FPGA is quite optimal over a broad range. Attempting to map different adder architectures onto the FPGA fabric is more likely to result in more complex solutions with lower performance. Table I in the reference paper supports this anecdotal conclusion/observation of mine. After all of the work required to implement a CSA, its performance is only marginally better for the test case than the performance given by the ripple carry adder, which maps naturally onto the FPGA logic. The CIA is a natural extension of both the CSA and RCA for the FPGA, and although it uses more resources than the RCA, it uses far less resources than the CSA adder. Its performance improvement falls about 1/3 of way between that of the RCA and the CSA.

_________________
Michael A.

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 2:39 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Thanks for your thoughts... I don't feel convinced yet! But of course I agree that the specifics of FPGAs are such that general arguments may or may not apply.

You are surely correct that carry select uses more resources. Is it too many? That depends! Are other smaller approaches fast enough? That too depends!

Three points:
- Binary addition may well be best implemented using the native carry support. But is it not the case that BCD can't make use of that? At best, the native carry will be adjusted between each nibble.
- Narrow adders are less of a problem than wide ones. So the paper comparing various approaches which looks only at 8 bit wide adders may be missing advantages that appear only in the wider case.
- You mention multiple cycles, whereas I'm thinking about single-cycle BCD addition (and wondering length that cycle is.) If you can take multiple cycles then other solutions may well be applicable. But, are you really saying that you think 2 cycles is the minimum for binary carry-select, and four cycles for BCD? Why would there be any need for more than a single cycle?

The basic nibble operation for BCD add, as I see it, would be to compute the 4-bit sum A+B+carry, detect if that's decimal 10 or more and perform an additional add of constant 6 if it is. That needs a cascade of two 4-bit adders, where the input to the second is a logic function of the output of the first. The critical path to carry out is going to be a few gates longer than the path to a 4-bit carry out, and maybe a little less than the path through two 4-bit adders.

Does that sound right to you?

Cheers
Ed

Top

barrym95838

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 3:14 pm

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA

BigEd wrote:

- I'm not sure about bringing c into anything other than the LSB calculation. Would it ever be a value other than 0 or 1?

Like I said, it's an experiment, with some interesting (at least to me) possibilities.

Quote:

- Computing V is a shot in the dark - I don't think there's any accepted spec of what V would mean, as the BCD values are not signed. Taking them to be 10's complement, which is I think what you did, might be as good a spec as any.

Yes, that's how I envisioned it.

Quote:

- Computing the output value of c as you do means that c needs to be more than 32 bits.

I believe that carry-out can only have three possible results in the case of ADC with three 32-bit inputs: 0, 1, or 2. A wide carry makes more sense when used with MUL and DIV, and that's why a initially implemented it. Looking at my simulation code, it appears that it will only work properly if none of the carry-in nybbles are greater than 1, so that would have to be corrected with even more logic (or carefully explained to a potential programmer).

Mike

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 4:31 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

(Sorry, my point about c was my own confusion: it's temp which needs one extra bit of headroom, and indeed you already made temp a 64 bit datatype)

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 4:41 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

I rustled up some verilog to see what the effect is of doing BCD calculation in the obvious nibble-wise manner:

target device xc3s50a, package tq144, speed -5

without hierarchy
16-bit binary adder 263Hz synth result
32-bit binary adder 206Hz synth result

with hierarchy using nibble adders (no difference in speed)
16-bit binary adder 263Hz synth result 16 LUTs, 27 slices (25 at PAR)
32-bit binary adder 206Hz synth result 32 LUTs, 54 slices (49 at PAR)

with hierarchy using BCD nibble adders
16-bit BCD adder 91MHz synth result, 32 LUTs, 34 slices (35 at PAR)
32-bit BCD adder 50MHz synth result, 64 LUTs, 67 slices (65 at PAR)

As our 6502 designs only reach about 50MHz on spartan 3, we might not be on the critical path with this straightforward approach.

Target Device : xc6slx4-3-tqg144
16-bit binary, 218MHz
32-bit binary, 171MHz
16-bit BCD, 133MHz
32-bit BCD, 86MHz, 152 LUTs

Edit: with a 16+(16 or 16) carry select approach on the xilinx6, I get
32-bit BCD, 109MHz, 225 LUTs
which is about the expected 50% area cost, and a worthwhile speed gain.

Attachments:

jc2_top.v.txt [1.44 KiB]
Downloaded 113 times

Top

GARTHWILSON

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 6:34 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California

Just a tad off topic here (and maybe someone would want to start another topic for it if there's more than just a quick answer), if we're talking about BCD add (and subtract) for 32-bitters, are there any plans for a BCD multiply like there are for binary multiply? It seems like it would be prohibitively expensive in FPGA resources. What applications are envisioned for a BCD add and subtract without BCD multiply (and possibly divide) in a 32-bitter that does have binary multiply (and possibly divide)? The good reason for BCD addition and subtraction in 8-bit 6502 HDL models would (I think) be to make sure existing software can run on it; but there is no software investment to protect yet with a 32-bit 65-family processor. I'm not being critical here, just puzzled and looking to get more understanding.

Related, what I put in the section of my website on large look-up tables for superfast, accurate scaled-integer math includes inversion since you can't have a look-up table for division but you can invert then multiply. The 1/X table of 65,536 entries uses 256KB because each entry gives a four-byte answer so you don't lose accuracy in the answer at either end of the range. That's not to say you always have to use the entire 32 bits though. Would inversion be a better use of resources on an FPGA than division, providing better overall performance too?

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

BigEd

Post subject: Re: fast wide BCD add

Posted: Mon Jan 13, 2014 8:29 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Hi Garth
hmm, you make me reconsider why BCD might be worthwhile. Mostly I suspect we're just aiming for 6502 compatibility. Addition and subtraction are not too difficult, and of course the 6502 stops there. For a point of sale system or a clock or video game decimal addition and subtraction has an attraction. I rather suspect (BCD) multiplication is horrendous. As you know, most FPGAs have hardware multipliers which makes (binary) multiplication an easy thing to add, and plausibly it's useful too: for array accesses, for graphical calculations, maybe for signal processing. There's no such shortcut to BCD multiplication.

As for 1/x, it might be a little easier than division, but my position is that a divide step instruction is a good tradeoff position to take. Of course if you built a system with lots of memory you could implement a table-driven approach. A hardware divider is quite a difficult thing to build (and to verify - look at the Pentium bug!)

Cheers
Ed

Top

White Flame

Post subject: Re: fast wide BCD add

Posted: Tue Jan 14, 2014 1:26 am

Joined: Tue Jul 24, 2012 2:27 am
Posts: 672

I guess this thread is supposed to be about Mike's original question?

barrym95838 wrote:

I think that there might be a bug in Mike Chambers' BCD adjustment inside adc():

Code:

...
        clearcarry();

        if ((a & 0x0F) > 0x09) {
            a += 0x06;
        }
        if ((a & 0xF0) > 0x90) {
            a += 0x60;
            setcarry();
        }
...

I believe that it always produces a BCD result, but ignores the half-carry, and will produce the wrong result for something as simple as 9+9 (if I'm reading the code properly, it looks like it will give an answer of $12, regardless of mode). I don't have my compiler handy at the moment, but I think that I'm right.

Yes, the code you listed is incorrect. It only checks for result nybbles in the 10-15 range for additional carry, missing the fixup required for resulting nybble additions in the 16-18 ($10-$12) range. In the latter case, the 1 will have been correctly carried up to the higher nybble, but the remaining lower nybble remains 0 to 2.

Plus, its output carry calculation is also wrong, missing the carry output for $90 + $90, due to the same reason.

barrym95838 wrote:

What do you guys think would be the most efficient method to include the half-carry, especially for an eight-digit version of the same function? I don't consider myself to be a C expert, but it appears to be a bit weak in its handling of carries and half-carries. I know that it can be done, just not very efficiently, at least be me.

You can perform the +6 fixup in software in nybble-parallel. Get a bitmask of each nybble that carried, say 0x00111001, multiply that by six, then add that to your result. However, just like the brokenness of the above code, you need to do this for every nybble that performed a carry, regardless of what the resulting nybble holds.

Checking for nybble values in the 10-15 range in parallel is pretty easy:
10 1010
11 1011
12 1100
13 1101
14 1110
15 1111

If bit3 & (bit2 | bit1), then we need to perform a carry. We can check these in parallel using ((num>>1) & num), and ((num>>2) & num), masking off the relevant resulting bits.

Checking for whether or not a nybble already carried up is also possible. Note that a bitwise add is an xor with carry fixup. So, xor the initial 2 values together, and compare (again via xor) the low bit of each nybble with the respective bit from the addition result. When that bit is different from the plain xor value, then a carry occurred there and the lower nybble needs to be fixed up.

The final output carry is an OR'ing of the original addition's carry, and the fixup addition's carry.

This is untested, but is along the right path:

Code:

int sum = a+b+carryIn;
int carryOut = ...whatever...;

// Detect & flag nybbles with values of 10-15 for fixup.
int nybbleOverflows = ((((sum >>1) & sum) >>2) | (((sum>>2) & sum) >>1) & 0x11111111;

// If a nybble had an input carry, the nybble below it needs fixup, hence the >>4
int halfCarries = ((a ^ b ^ sum) & 0x11111110) >> 4;

sum += (nybbleOverflows | halfCarries) * 6;
carryOut |= ...whatever...

Since the addition goes beyond the length of the machine word, calculating the output carry isn't straightforward. If you write this in assembly code, it becomes a lot easier to deal with that step. You can likely eliminate some shifting by keeping the flags in bit1 instead of bit0 of each nybble.

EDIT: Ah, I just noticed that you're doing 32-bit math on a 64-bit machine. Well, then keep calm and carry out!

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor

Top

barrym95838

Post subject: Re: fast wide BCD add

Posted: Tue Jan 14, 2014 4:08 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1927
Location: Sacramento, CA, USA

GARTHWILSON wrote:

... The good reason for BCD addition and subtraction in 8-bit 6502 HDL models would (I think) be to make sure existing software can run on it; but there is no software investment to protect yet with a 32-bit 65-family processor. I'm not being critical here, just puzzled and looking to get more understanding.

You make a good point, Garth. I am tempted to move on to the next (more important) issue with shift and rotate, and leave BCD alone until later.

@White Flame: Thanks for your ideas. I don't fully understand every detail of your explanation, but I'm definitely book-marking this thread for later reference.

Mike

Top

MichaelM

Post subject: Re: fast wide BCD add

Posted: Tue Jan 14, 2014 4:39 am

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

BigEd:

The combinatorial path for the BCD adder from my FPGA core, when I tried to implement it as a single cycle operation, was too long and was a prime driver in reducing the performance achievable. I broke the combinatorial path with a single pipeline register, and allowed BCD addition/subtraction to require two memory cycles. (It is my understanding that 65C02 ADC/SBC instructions require two cycles when D is set in the processor status word. [p. 327 of Programmer's Reference Manual] Thus, this design decision is consistent with the characteristics of the original 65C02.)

The issue I found in getting my BCD adder to perform correctly in both addition and subtraction was that the fixup value each of the two nibbles were not just +6 when a nibble carry occurred. The following is a snippet from my brute force implementation of a BCD add/sub for my core:

Code:

//  Adder Second Stage - Combinatorial; BCD Digit Adjustment

always @(*)
begin
    // Generate Decimal Mode Digit Adjust Signals

    DA[1] <= ((rOp) ? ~rC7 | (DA[0] & MSN_GT9)
                    :  rC7 | MSN_GT9 | (DA[0] & ~rC3 & MSN_GT8));
    DA[0] <= ((rOp) ? ~rC3
                    :  rC3 | LSN_GT9);

    case(DA)
        2'b01   : Adj <= (rS + ((rOp) ? 8'hFA : 8'h06)); // ±06 BCD
        2'b10   : Adj <= (rS + ((rOp) ? 8'hA0 : 8'h60)); // ±60 BCD
        2'b11   : Adj <= (rS + ((rOp) ? 8'h9A : 8'h66)); // ±66 BCD
        default : Adj <= (rS + 8'h00);                   // 0
    endcase
end

The DA[1:0] variable determines what value should be applied to fixup the partial sum (rS), which was naturally computed as binary sum during the first stage of the adder. DA[1] is the decimal adjust flag/signal for the most significant nibble, and DA[0] is the decimal adjust flag/signal for the least significant nibble. rOp is the operation, and is a 1 for SBC instructions and 0 for ADC instructions.

Thus, for subtraction, the most significant nibble (MSN) requires adjustment if there is no carry out of the MSN, or an adjustment is required of the least significant nibble (LSN) and the MSN is greater than 8. For addition, the MSN requires adjustment if there is a nibble carry from the MSN, or the MSN > 9, or the LSN requires adjusment when no carry from the LSN and the MSN is greater than 8.

The equation for decimal adjustment of the LSN is less complicated. Decimal adjustment of the LSN is required when subtracting if there is no carry from the LSN, and when adding, decimal adjustment of the LSN is required if there is a carry out of the LSN or the LSN is greater than 9.

After I determine the MSN and LSN decimal adjustment signals, DA[1:0], I adjust the partial sum using eight values that I lookup from a small ROM based on the operation being performed (add or sub) and the computed nibble decimal adjust flags. The values that I apply to the partial sum to get the correct BCD result are: 00, 06, 60, and 66 for addition; 00, FA (-6 in 10's complement), A0 (-60), and 9A (-66).

Like you, I set up a dedicated test harness with which to synthesize the M65C02_BCD.v module in order to measure the theoretical performance: all module inputs are registered to eliminate the input delays from IOBs, and all module outputs are also registered to eliminate the output delays of the IOBs. In this manner, the reported performance measures the best theoretical performance that the synthesizer and PAR tools can achieve without any placement or period constraints.

For the M65C02_BCD module, using an XC3S200A-4 part, the Xilinx tools report that the best clock period that can be achieved is 7.287 ns. The resource utilization reported is 15 registers, 50 LUTs, and 33 slices. None of these registers, LUTs, or slices are part of the test harness in which I placed the module.

In essence, the second stage is implemented as a carry look ahead adder. The equation that I derived for DA[1] is coupled to DA[0]. Without having derived the recurrence relation/equation for the decimal adjust flag of the next nibble, I would venture a guess that it will be a function of the DA flag of all the previous nibbles. This may explain why Intel did not include a flow through BCD adjustment cycle in their 8080/8085 processors, and why they have two BCD adjust instructions which use the nibble carry in the PSW and some information regarding whether an addition or subtraction was performed.

To address your conjecture regarding the simplification of BCD addition with increasing width, I think that the preceding explanation, which shows that the decimal adjustment flag for the next significant nibble is a function of the decimal adjustment of the preceding nibble, should put those thoughts to bed. Like high precision binary addition/subtraction, adjustment of higher order nibbles will be dependent on the rippling of the adjustment flags in a manner analogous to ripple carry for simple binary addition.

In my investigations/studies of computer architectures of the 50s, 60s, and 70s, I've come across a couple of architectures that were BCD based. In some cases, I found that the ALUs in these computers operated in some kind of serial fashion. The Burroughs B2500/3500/4500 family strictly used BCD for both addresses and arithmetic. Interestingly, they supported BCD arithmetic with as many a 100 digits. They are able to do this because they built their ALU to operate in a nibble/digit serial fashion. Operating in this manner, the cascading of the carry/BCD adjustment is naturally pipelined, and does not require large and complex equations such as the ones I derived for my parallel BCD adder.

That last statement leads me to propose that if someone wishes to simulate the operation of a 16 or 32 bit BCD adder in C (or other language), then adopting a serial approach to the computation and adjustment of each nibble may be easier to implement, and ultimately faster, than a parallel addition approach with a bunch of nibble carry registers/variables. I think this may be a better approach for barrym95838 to use to simulate a 32-bit ADC/SBC operation.

_________________
Michael A.

Last edited by MichaelM on Sun Mar 31, 2019 2:16 pm, edited 1 time in total.

Top

Page 1 of 2

[ 30 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

Who is online

Users browsing this forum: No registered users and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum