BigEd:
The combinatorial path for the BCD adder from my FPGA core, when I tried to implement it as a single cycle operation, was too long and was a prime driver in reducing the performance achievable. I broke the combinatorial path with a single pipeline register, and allowed BCD addition/subtraction to require two memory cycles. (It is my understanding that 65C02 ADC/SBC instructions require two cycles when D is set in the processor status word. [p. 327 of Programmer's Reference Manual] Thus, this design decision is consistent with the characteristics of the original 65C02.)
The issue I found in getting my BCD adder to perform correctly in both addition and subtraction was that the fixup value each of the two nibbles were not just +6 when a nibble carry occurred. The following is a snippet from my brute force implementation of a BCD add/sub for my core:
Code:
// Adder Second Stage - Combinatorial; BCD Digit Adjustment
always @(*)
begin
// Generate Decimal Mode Digit Adjust Signals
DA[1] <= ((rOp) ? ~rC7 | (DA[0] & MSN_GT9)
: rC7 | MSN_GT9 | (DA[0] & ~rC3 & MSN_GT8));
DA[0] <= ((rOp) ? ~rC3
: rC3 | LSN_GT9);
case(DA)
2'b01 : Adj <= (rS + ((rOp) ? 8'hFA : 8'h06)); // ±06 BCD
2'b10 : Adj <= (rS + ((rOp) ? 8'hA0 : 8'h60)); // ±60 BCD
2'b11 : Adj <= (rS + ((rOp) ? 8'h9A : 8'h66)); // ±66 BCD
default : Adj <= (rS + 8'h00); // 0
endcase
end
The DA[1:0] variable determines what value should be applied to fixup the partial sum (rS), which was naturally computed as binary sum during the first stage of the adder. DA[1] is the decimal adjust flag/signal for the most significant nibble, and DA[0] is the decimal adjust flag/signal for the least significant nibble. rOp is the operation, and is a 1 for SBC instructions and 0 for ADC instructions.
Thus, for subtraction, the most significant nibble (MSN) requires adjustment if there is no carry out of the MSN, or an adjustment is required of the least significant nibble (LSN) and the MSN is greater than 8. For addition, the MSN requires adjustment if there is a nibble carry from the MSN, or the MSN > 9, or the LSN requires adjusment when no carry from the LSN and the MSN is greater than 8.
The equation for decimal adjustment of the LSN is less complicated. Decimal adjustment of the LSN is required when subtracting if there is no carry from the LSN, and when adding, decimal adjustment of the LSN is required if there is a carry out of the LSN or the LSN is greater than 9.
After I determine the MSN and LSN decimal adjustment signals, DA[1:0], I adjust the partial sum using eight values that I lookup from a small ROM based on the operation being performed (add or sub) and the computed nibble decimal adjust flags. The values that I apply to the partial sum to get the correct BCD result are: 00, 06, 60, and 66 for addition; 00, FA (-6 in 10's complement), A0 (-60), and 9A (-66).
Like you, I set up a dedicated test harness with which to synthesize the M65C02_BCD.v module in order to measure the theoretical performance: all module inputs are registered to eliminate the input delays from IOBs, and all module outputs are also registered to eliminate the output delays of the IOBs. In this manner, the reported performance measures the best theoretical performance that the synthesizer and PAR tools can achieve without any placement or period constraints.
For the M65C02_BCD module, using an XC3S200A-4 part, the Xilinx tools report that the best clock period that can be achieved is 7.287 ns. The resource utilization reported is 15 registers, 50 LUTs, and 33 slices. None of these registers, LUTs, or slices are part of the test harness in which I placed the module.
In essence, the second stage is implemented as a carry look ahead adder. The equation that I derived for DA[1] is coupled to DA[0]. Without having derived the recurrence relation/equation for the decimal adjust flag of the next nibble, I would venture a guess that it will be a function of the DA flag of all the previous nibbles. This may explain why Intel did not include a flow through BCD adjustment cycle in their 8080/8085 processors, and why they have two BCD adjust instructions which use the nibble carry in the PSW and some information regarding whether an addition or subtraction was performed.
To address your conjecture regarding the simplification of BCD addition with increasing width, I think that the preceding explanation, which shows that the decimal adjustment flag for the next significant nibble is a function of the decimal adjustment of the preceding nibble, should put those thoughts to bed. Like high precision binary addition/subtraction, adjustment of higher order nibbles will be dependent on the rippling of the adjustment flags in a manner analogous to ripple carry for simple binary addition.
In my investigations/studies of computer architectures of the 50s, 60s, and 70s, I've come across a couple of architectures that were BCD based. In some cases, I found that the ALUs in these computers operated in some kind of serial fashion. The Burroughs B2500/3500/4500 family strictly used BCD for both addresses and arithmetic. Interestingly, they supported BCD arithmetic with as many a 100 digits. They are able to do this because they built their ALU to operate in a nibble/digit serial fashion. Operating in this manner, the cascading of the carry/BCD adjustment is naturally pipelined, and does not require large and complex equations such as the ones I derived for my parallel BCD adder.That last statement leads me to propose that if someone wishes to simulate the operation of a 16 or 32 bit BCD adder in C (or other language), then adopting a serial approach to the computation and adjustment of each nibble may be easier to implement, and ultimately faster, than a parallel addition approach with a bunch of nibble carry registers/variables. I think this may be a better approach for barrym95838 to use to simulate a 32-bit ADC/SBC operation.