6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 3:39 am

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Fast BCD multiplication
PostPosted: Sat Nov 11, 2017 10:17 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
I was curious about the fastest way to do 8 bit packed BCD multiplication, so I tried it a few different ways to get the average cycles required. I ran all combinations from $00 x $00 to $99 x $99 in a loop in the simulator. For each one, I subtracted the loop overhead time from the total time and divided by 10,000 to get average cycles per calculation. Here are the results:

1. Repeated additions: 1,596
2. Repeated additions, smaller multiplicand in counter: 1,236.66
3. Shift/add: 300.8
4. 4x8 bit lookup table: 107
5. 4x8 bit lookup tables aligned to $x000: 103
6. Russian peasant algorithm: 274.02
7. Quarter squares with first half of table: 115.33
8. Quarter squares with whole table: 118

Code is below. Please let me know what I can improve or if you see any mistakes.

Main loop
Code:
.START Main   ;start execution at Main
   
   ;Macros for Kowalski simulator (only needed for debug features)
   .INCLUDE "..\k_io.65s"
   
   ;4x8 bit lookup table
   .ORG $1000
   .INCLUDE "table1.65s"
   
   ;Quarter square lookup table
   .ORG $4000
   .INCLUDE "table3.65s"
   
   ;4x8 lookup table split into four for alignment
   .ORG $5000
   .INCLUDE "table5lo.65s"
   .ORG $6000
   .INCLUDE "table5hi.65s"
   .ORG $7000
   .INCLUDE "table6lo.65s"
   .ORG $8000
   .INCLUDE "table6hi.65s"

;Beginning of functions
   .ORG $F000
Main:
   
Arg1         .SET $10         ;First multiplicand
Arg2         .SET $11         ;Second multiplicand
ResultLo      .SET $12         ;16 bit result
ResultHi      .SET $13
R0         .SET $14         ;Generic counter/pointer variables
R1         .SET $15
R2         .SET $16
R3         .SET $17
   
   STZ Arg1               ;Multiplicand 1
   STZ Arg2               ;Multiplicand 2

MultiplyRoutine=1         ;Which of 8 rountines to use
Debug=0               ;Boolean: print multicands and answers?

   SED      
.multiplyloop               ;Loop from $0 x $0 through $99 x $99
   .IF Debug==1         ;If debug is enabled, print multiplicands
      LDA Arg2
      puthex
      LDA #'*'
      putc
      LDA Arg1
      puthex
      LDA #'='
      putc
   .ENDIF

   ;************************
   ;**Multiplier code here**
   ;************************


.multiplydone
   .IF Debug==1         ;If debug is enabled, print the answer
      LDA ResultHi
      puthex
      LDA ResultLo
      puthex
      putnl
   .ENDIF
   LDA Arg1               ;Increase multiplicand 1 by one
   CLC
   ADC #1
   STA Arg1
   BCS .carryset         ;If no carry, jump back to top of loop
      JMP .multiplyloop
.carryset               ;If carry is set, increment multiplicand 2
   LDA Arg2
   ADC #0
   STA Arg2
   BCS .loopdone         ;If multiplicand 1 is $99 and multiplicand 2 roles over, end
      JMP .multiplyloop
.loopdone
   
   BRK

I also added checks for 0 and 1 to skip calculating, but using either or both of those was slower on average for all 8 methods, so I left them out.
Code:
   .IF CheckForZero==1   ;If one multiplicand is zero, skip calculation
      LDA Arg1            ;Check multiplicand 1 for zero
      BEQ .zero         
         LDA Arg2         ;Check multiplicand 2 for zero
         BEQ .zero
            BRA .notzero;If neither is zero, continue
.zero
      .IF Debug==1      ;Print 'z' to show no calculation was done
         LDA #'z'
         putc
      .ENDIF
      STZ ResultLo      ;Store zero in result
      STZ ResultHi
      BRA .multiplydone   ;Skip to end of loop avoiding calculation
.notzero
   .ENDIF

   .IF CheckForOne==1   ;If one multiplicand is one, skip calculation
      LDA Arg1            ;Check multiplicand 1 for one
      CMP #1
      BEQ .Arg1one      ;If so, copy multiplicand 2 to result
         LDA Arg2         ;Check multiplicand 2 for one
         CMP #1
         BEQ .Arg2one   ;If so, copy multiplicand 1 to result
            BRA .notone;If neither is one, continue
.Arg1one
      .IF Debug==1
         LDA #'a'         ;Print 'a' to show no calculation was done
         putc
      .ENDIF
      LDA Arg2            ;Copy multiplicand 2 to result
      STA ResultLo
      STZ ResultHi
      BRA .multiplydone   ;Skip to end of loop avoiding calculation
.Arg2one
      .IF Debug==1
         LDA #'b'         ;Print 'a' to show no calculation was done
         putc
      .ENDIF
      LDA Arg1            ;Copy multiplicand 2 to result
      STA ResultLo
      STZ ResultHi
      BRA .multiplydone ;Skip to end of loop avoiding calculation
.notone
   .ENDIF

1. Repeated additions (1,596 cycles)
Code:
   .IF MultiplyRoutine==1
      STZ ResultLo       ;Zero results
      STZ ResultHi
      LDX Arg1            ;Hold multiplicand 1 in X to copy quickly
      LDA Arg2            ;Store multiplicand 2 in R0 counter
      STA R0            
.calcloop               ;Calculation loop
      BEQ .multiplydone   ;Loop as long as R0 counter is not 0
         TXA            ;Load multiplicand 1 copy stored in X
         CLC
         ADC ResultLo   ;Add to result
         STA ResultLo   
         BCC .calcnocarry   ;Skip updating high byte of result if no carry
            LDA #0         ;Load 0 since only carry is being added
            ADC ResultHi   ;Add carry
            STA ResultHi
.calcnocarry
         ;Decrease R0 counter by one
         LDA R0
         SBC #0
         STA R0
      BRA .calcloop
   .ENDIF

2. Repeated additions, smaller multiplicand in counter (1,236.66 cycles)
Code:
.IF MultiplyRoutine==2
      STZ ResultLo      ;Zero results
      STZ ResultHi
      
      LDA Arg1            ;Check which multiplicand is larger
      CMP Arg2
      BMI .switchargs
         LDX Arg1         ;Hold multiplicand 1 in X to copy quickly
         LDA Arg2         ;Store multiplicand 2 in R0 counter
         STA R0
         BRA .switchdone      
.switchargs
      LDX Arg2            ;Hold multiplicand 2 in X to copy quickly
      LDA Arg1            ;Store multiplicand 2 in R0 counter
      STA R0
.switchdone

;REST IS THE SAME AS Method 1 above
.calcloop               ;Calculation loop
      BEQ .multiplydone ;Loop as long as R0 counter is not 0
         TXA            ;Load multiplicand 1 copy stored in X
         CLC
         ADC ResultLo   ;Add to result
         STA ResultLo
         BCC .calcnocarry   ;Skip updating high byte of result if no carry
            LDA #0         ;Load 0 since only carry is being added
            ADC ResultHi   ;Add carry
            STA ResultHi
.calcnocarry
         ;Decrease R0 counter by one
         LDA R0
         SBC #0
         STA R0
      BRA .calcloop
   .ENDIF

3. Shift/add (300.8 cycles)
Code:
   .IF MultiplyRoutine==3
      STZ ResultLo      ;Zero results
      STZ ResultHi
      
      ;Start with ones place of multiplicand one
      LDA Arg1
      AND #$F
      TAX               ;Use X to countdown
.calcloop
      BEQ .ones_done      ;Loop until ones place of counter reaches 0
         ;Add multiplicand 2 to result
         LDA Arg2
         CLC
         ADC ResultLo
         STA ResultLo
         ;Carry to upper byte if necessary
         BCC .nocarry
            LDA #0
            ADC ResultHi
            STA ResultHi
.nocarry
      ;Decrease counter works on BCD since only ones place
      DEX
      BRA .calcloop
.ones_done
      
      ;Shift ones nibble of multiplicand 2 left and store in R1
      LDA Arg2
      ASL
      ASL
      ASL
      ASL
      STA R1
      
      ;Shift 10s nibble of multiplicand 2 to use as high byte
      LDA Arg2
      ROR
      ROR
      ROR
      ROR
      AND #$F
      STA R2
      
      ;Use 10s nibble of multiplicand 1 as counter in X
      LDA Arg1
      ROR
      ROR
      ROR
      ROR
      AND #$F
      TAX
      
      CLC
.loop10s
      BEQ .tens_done      ;Loop until X counts down to 0
         LDA R1         ;Load low byte of shifted muliplicand 2
         ADC ResultLo   ;Add to result
         STA ResultLo
         LDA R2
         ADC ResultHi   ;Load high byte of shifted multiplicand 2
         STA ResultHi
         DEX            ;Decrement 10s counter of multiplicand 1
      BRA .loop10s      
.tens_done
   .ENDIF

4. 4x8 bit lookup tables (107 cycles)
Tables are stored as $0xyy where x is a 4 bit BCD number and yy is an 8 bit packed BCD number. Each table has a high and low section for the high and low values of the 16 bit answer. Table one has unshifted answers where $0xyy = x * yy and table two is shifted answers where $0xyy = x * yy0. Note that BCD mode is enabled and disabled to do pointer arithmetic.
Code:
   .IF MultiplyRoutine==4
      LDA Arg1            ;Store multiplicand 1 in table pointer R0
      STA R0            
      LDA Arg2            ;Get low nibble of multiplicand 1
      AND #$F
      PHA               ;Make a copy of nibble for second half of table pointer
      CLC
      CLD               ;Disable BCD mode for pointer arithmetic
      ADC #>Table1_Lo   ;Add Table1 offset to high byte of table pointer
      STA R1
      LDA (R0)            ;Copy table entry to lowbyte of result
      STA ResultLo      
      PLA               ;Get copy of low nibble of multiplicand 2
      ADC #>Table1_Hi   ;Add Table1 offset to high byte of table pointer
      STA R1
      LDA (R0)            ;Copy table entry to lowbyte of result
      STA ResultHi
      LDA Arg2            ;Get high nibble of multiplicand 2
      ROR               
      ROR
      ROR
      ROR
      AND #$F
      PHA               ;Make a copy of nibble for second half of table pointer
      CLC                
      ADC #>Table2_Lo   ;Add Table1 offset to high byte of table pointer
      SED               ;Reenable BCD for adding table entry to result
      STA R1            ;High byte of table pointer
      LDA (R0)            ;Add table entry to low byte of result
      ADC ResultLo
      STA ResultLo
      LDA #0            ;Apply carry to high byte
      ADC ResultHi
      STA ResultHi
      PLA               ;Get copy of high nibble o multiplicand 2
      CLD               ;Disable BCD for pointer arithmetic
      ADC #>Table2_Hi   ;Add Table1 offset to high byte of table pointer
      SED               ;Reenable BCD for adding table entry to result
      STA R1            ;High byte of table pointer
      LDA (R0)            ;Add table entry to high byte of result
      ADC ResultHi
      STA ResultHi
   .ENDIF

5. 4x8 bit lookup tables aligned to $x000 (103 cycles)
Same as method five but the high and low parts of tables 1 and 2 are split up into four different files and each is aligned to $x000, so pointer arithmetic can be done in BCD mode.
Code: Same as method 4 without SED and CLD. Also, uses Table5 and Table 6 instead of Table1 and Table2.

6. Russian peasant algorithm (274.02 cycles)
Halving the smaller multiplicand is a little faster on average.
Code:
   .IF MultiplyRoutine==6
      ;Put larger of the two multiplicands in R0 to be doubled
      ;Put smaller multiplicand in R2 to be halved
      LDA Arg1      
      CMP Arg2            
      BMI .switchargs
         STA R0
         LDA Arg2
         STA R2
         BRA .dontswitch
.switchargs
         STA R2
         LDA Arg2
         STA R0
.dontswitch
      STZ R1               ;Clear high byte of 16 bit value for doubling
      STZ ResultLo         ;Clear results
      STZ ResultHi
      CLC
.calcloop   
      BBR #0,R2,.dontadd   ;Test halving variable for evenness
         ;Add 16 bit doubling variable (R0-R1) to result
         LDA R0            
         ADC ResultLo
         STA ResultLo
         LDA R1
         ADC ResultHi
         STA ResultHi
.dontadd                     ;Jump here if halving variable is even
      
      ;Double 16 bit doubling variable
      LDA R0
      ADC R0
      STA R0
      LDA R1
      ADC R1
      STA R1
      
      ;Halve the halving variable
      CLC
      ROR R2
      CLC
      BBR #3,R2,.dontcorrect   ;Check if BCD correction needed
         RMB #3,R2            ;Clear high bit of low nibble shifted in from high nibble
         LDA #5               ;Add 5 to low nibble instead
         ADC R2
         STA R2
.dontcorrect                  ;No BCD correction needed
      BEQ .calcloopdone
         JMP .calcloop
.calcloopdone
   .ENDIF

7. Quarter squares with first half of table (115.33 cycles)
I tried implementing the algorithm myself and ended up only using half the table that other algorithms I looked at later used. (This is not to save space, it just didn't occur to me to generate the second half since the first half is sufficient).
Code:
   .IF MultiplyRoutine==7      
      CLC               ;Store sum of multiplicand 1 and 2 in table pointer R0
      LDA Arg1                  
      ADC Arg2
      STA R0
      LDA #0            ;Store copy of carry in R2 for later
      ADC #0
      STA R2
      ADC #>Table3_Lo   ;Add offset into first table to high byte of table pointer
      STA R1
      LDA (R0)            ;Store table entry in result
      STA ResultLo
      LDA R2            ;Retrieve carry from adding multiplicands
      ADC #>Table3_Hi   ;Add offset into first table to high byte of table pointer
      STA R1
      LDA (R0)            ;Store table entry in result
      STA ResultHi
      SEC               
      LDA Arg1            ;Find difference of multiplicands
      SBC Arg2
      BCS .notinverse   ;Check if difference is negative
         STA R3         ;If negative, take absolute value
         SEC
         LDA #$00
         SBC R3
.notinverse               ;Jump here if difference is not negative
      STA R0            ;Store absolute value of difference of multiplicands in table pointer
      LDA #>Table3_Lo   ;High byte of pointer to table
      STA R1
      LDA (R0)            ;Retrieve table value
      STA R3            ;Save temporarily
      LDA ResultLo      ;Subract table value from result from earlier table access
      SEC
      SBC R3
      STA ResultLo
      BCS .nosubcarry   ;Borrow from high byte of result if subtraction requires
         LDA ResultHi
         SBC #0
         STA ResultHi
.nosubcarry
      LDA #>Table3_Hi   ;High byte of pointer into table
      STA R1
      LDA (R0)            ;Retrieve table value
      STA R3            ;Save temporarily
      LDA ResultHi      ;Subtract table value from high byte of result
      SEC
      SBC R3
      STA ResultHi
   .ENDIF

8. Quarter squares with whole table (118 cycles)
I tried to base this off the fast multiplication routine I found here:
http://codebase64.org/doku.php?id=base:seriously_fast_multiplication

In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7. Strangely, using the larger table is slightly slower as I have implemented it. I think it might be possible to improve this one. Any ideas?
Code:
   .IF MultiplyRoutine==8
      CLC                ;Store sum of multiplicand 1 and 2 in table pointer R0
      LDA Arg1
      ADC Arg2
      STA R0
      LDA #0            ;Store copy of carry in R2 for later
      ADC #0
      STA R2
      ADC #>Table3_Lo   ;Add offset into first table to high byte of table pointer
      STA R1
      LDA (R0)            ;Store table entry in result
      STA ResultLo
      LDA R2            ;Retrieve carry from adding multiplicands
      ADC #>Table3_Hi   ;Add offset into first table to high byte of table pointer
      STA R1
      LDA (R0)            ;Store table entry in result
      STA ResultHi
      
      ;Take inverse of multiplicand 1 before summing multiplicands
      ;(Same trick as EOR $#FF when calculation is binary)
      SEC               
      LDA #$99
      SBC Arg1
      CLC               ;Find sum of multiplicands
      ADC Arg2
      STA R0            ;Store in low byte of table pointer
      LDA #0            ;Save copy of carry
      ADC #0
      STA R2
      ADC #>Table4_Lo   ;Add carry and table address to high byte of table pointer
      STA R1            
      SEC
      LDA ResultLo      ;Subtract table entry from result from first table access
      SBC (R0)
      STA ResultLo
      LDA ResultHi      ;Borrow from high byte of result if necessary
      SBC #0
      STA ResultHi
      CLC               
      LDA R2            ;Retrieve carry from summing multiplicands
      ADC #>Table4_Hi   ;Add carry and table address to high byte of table pointer
      STA R1
      SEC               ;Subtract table entry from high byte of result
      LDA ResultHi
      SBC (R0)
      STA ResultHi
   .ENDIF


Attachments:
table6lo.asm [13.26 KiB]
Downloaded 163 times
table6hi.asm [13.85 KiB]
Downloaded 154 times
table5lo.asm [11.89 KiB]
Downloaded 158 times
table5hi.asm [12.47 KiB]
Downloaded 150 times
table3.asm [8.49 KiB]
Downloaded 163 times
table1.asm [51.47 KiB]
Downloaded 162 times
multtest.asm [13.41 KiB]
Downloaded 158 times
k_io.asm [753 Bytes]
Downloaded 159 times
Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 4:37 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
With small look up tables to replace 4-bit shifts and the halving of the Russian peasant algorithm, I shaved off 4-16 cycles from routines 3-6.

I tried converting the BCD multiplicands to decimal with a look up table so I could use the faster quarter square algorithm. It finishes in 62 cycles on average and needs 24 more (86 total) to convert decimal back to BCD with a table.

Code:
      ;Set up quarter square registers (at startup)
      LDA #>Table7_Lo
      STA QS1_Hi
      LDA #>Table7_Hi
      STA QS2_Hi
      LDA #>Table8_Lo
      STA QS3_Hi
      LDA #>Table8_Hi
      STA QS4_Hi

   .IF MultiplyRoutine==9
      LDX Arg1            ;Convert Arg1 from BCD to binary
      LDA HEXtoBIN,X
      TAY               ;Keep binary representation in Y
      LDX Arg2
      LDA HEXtoBIN,X      ;Convert Arg2 from BCD to binary
      STA QS1_Lo         ;Low byte of pointer into first table
      STA QS2_Lo         ;High byte of pointer into first table
      EOR #$FF            
      STA QS3_Lo         ;Low byte of pointer into second table
      STA QS4_Lo         ;High byte of pointer into second table
      
      CLD               ;Disable decimal mode for binary arithmetic
      SEC
      LDA (QS1_Lo),Y      ;Difference of low bytes from first and second table
      SBC (QS3_Lo),Y
      TAX               ;Store in X for indexing into binary to BCD table
      LDA (QS2_Lo),Y      ;Difference of high bytes from first and second table
      SBC (QS4_Lo),Y
      TAY               ;Store in Y for indexing into binary to BCD table
      
      ;Convert decimal to BCD
      SED
      CLC
      LDA Table9_Lo,X   ;Add low bytes of BCD look up table
      ADC Table10_Lo,Y   
      STA ResultLo      
      LDA Table9_Hi,X   ;Add high bytes of BCD look up table
      ADC Table10_Hi,Y
      STA ResultHi
      
   .ENDIF


Attachments:
converters.asm [4.82 KiB]
Downloaded 152 times
table7.asm [16.92 KiB]
Downloaded 147 times
Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 10:14 am 
Offline

Joined: Sat May 02, 2015 6:59 pm
Posts: 134
I started working on a test using the ūrdhva tiryagbhyāṃ Sūtra for multiplication, though ran into a 4bit shift issue of my own (too many).
I can probably expand the table to get around it.

Basically, the idea was to break the single 8x8 multiply into four 4x4 multiplies where the results can be fetched from a lookup table.
The table address consists of the high and low nyble to be multiplied with the data containing the result.

Code:

;     0    1    2    3    4    5    6    7    8    9  |---- Address  Padding ----|

.DB $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $FA, $FB, $FC, $FD, $FE, $FF   ; x 0
.DB $00, $01, $02, $03, $04, $05, $06, $07, $08, $09, $FA, $FB, $FC, $FD, $FE, $FF   ; x 1
.DB $00, $02, $04, $06, $08, $10, $12, $14, $16, $18, $FA, $FB, $FC, $FD, $FE, $FF   ; x 2
.DB $00, $03, $06, $09, $12, $15, $18, $21, $24, $27, $FA, $FB, $FC, $FD, $FE, $FF   ; x 3
.DB $00, $04, $08, $12, $16, $20, $24, $28, $32, $36, $FA, $FB, $FC, $FD, $FE, $FF   ; x 4
.DB $00, $05, $10, $15, $20, $25, $30, $35, $40, $45, $FA, $FB, $FC, $FD, $FE, $FF   ; x 5
.DB $00, $06, $12, $18, $24, $30, $36, $42, $48, $54, $FA, $FB, $FC, $FD, $FE, $FF   ; x 6
.DB $00, $07, $14, $21, $28, $35, $42, $49, $56, $63, $FA, $FB, $FC, $FD, $FE, $FF   ; x 7
.DB $00, $08, $16, $24, $32, $40, $48, $56, $64, $72, $FA, $FB, $FC, $FD, $FE, $FF   ; x 8
.DB $00, $09, $18, $27, $36, $45, $54, $63, $72, $81, $FA, $FB, $FC, $FD, $FE, $FF   ; x 9



I put together a couple of examples to help show the workings behind the math.
Attachment:
File comment: Urdhva Tiryagbhyam worked examples.
urdhva_tiryagbhyam.png
urdhva_tiryagbhyam.png [ 31.31 KiB | Viewed 5101 times ]


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 2:56 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Interesting results. We discussed some routines on Denial last year:

Multiplication in assembler

Larger tables translates into quicker routines.

Another way:
Code:
// y*a = ((y+a)^2-(y-a)^2)/4
STA $1
TYA
STA $2
SEC
SBC $1
TAX
TYA
CLC
// 3+2+3+2+3+2+3+2  +3+2+2+2+3+2+2+2 (or 5+3+3+3+3)
ADC $2
TAY
BCS trop
LDA squaretableLSB,Y    // All results pre-divided by 4, 0-255 table
SEC
SBC squaretableLSB,X    // All results pre-divided by 4, 0-255 table
STA resultLSB
LDA squaretableMSB,Y
SBC squaretableMSB,X
STA resultMSB
RTS
trop:
LDA squaretableLSB2,Y    // All results pre-divided by 4, 0-255 table
SEC
SBC squaretableLSB,X    // All results pre-divided by 4, 0-255 table
STA resultLSB
LDA squaretableMSB2,Y
SBC squaretableMSB,X
STA resultMSB
RTS


Clocks in at 53 cycles, but with a 512 byte table. Unceremoniously non-tested and non-optimized. For BCD, you only need a 200 byte table.

Edit 2:


Last edited by kakemoms on Fri Nov 17, 2017 1:28 pm, edited 3 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 3:42 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Quote:
In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7.

I haven't studied this carefully, but even if X holds a BCD number, surely it can still index into a table, it's just that the table is a bit sparsely populated - 100 entries in a field of 256.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 5:32 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
Another way:
Cool. What are the lines before the table read doing?

Quote:
Clocks in at 53 cycles, but with a 512 byte table.
So are the squaretableLSB and squaretableMSB tables each 512 bytes?

Quote:
For BCD, you only need a 200 byte table.
Right, but I don't think you can use X indexing into the square table for BCD numbers (unless that is what you are handling in the first few lines). Have you tested this way with BCD numbers?

BigEd wrote:
Quote:
In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7.

I haven't studied this carefully, but even if X holds a BCD number, surely it can still index into a table, it's just that the table is a bit sparsely populated - 100 entries in a field of 256.

The table holds values for (a+b)^2/4, so any two numbers added together should point to the result of that calculation in the table like this:
Code:
LDA #>Table
STA PtrHi
LDA Arg1
STA PtrLo
LDX Arg2
LDA (PtrLo),X

So imagine if Arg1 is 30 and Arg2 is 40. The low byte of the pointer will be 70 and it will point to the table entry for (70)^2/4. It works the same if Arg1 is 35 and Arg2 is 35.
With BCD, though, if Arg1 is $30 and Arg2 $40, the pointer will not point to the same address as when Arg1 is $35 and Arg2 is $35.


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 6:12 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Druzyek wrote:
Quote:
Another way:
Cool. What are the lines before the table read doing?

Quote:
Clocks in at 53 cycles, but with a 512 byte table.
So are the squaretableLSB and squaretableMSB tables each 512 bytes?

Quote:
For BCD, you only need a 200 byte table.
Right, but I don't think you can use X indexing into the square table for BCD numbers (unless that is what you are handling in the first few lines). Have you tested this way with BCD numbers?


I think I edited my post while you were writing an answer, but anyway, corrected code and statements are shown above (in previous post).

The code before the table is to calculate (a+b)/2 and (a-b)/2 for BCD.

You need to correct the LSR for BCD (both a-b and a+b). The only problem with LSR is if there is something in bit 4 before the LSR (=10) in which it ends in bit 3 afterwards (=8) which is 3 more than it should be. So by testing bit 3 and subtracting 3 (if its there) in both a and b, the result becomes correct (see here for a better explanation)

The lookup table is simply to get the square value of a number. There are 256 possible inputs for a byte, so 256 bytes in the table LSB and 256 bytes in the table MSB. For BCD you only have 100 possible inputs, but the table needs to cover as much memory as for a full byte since you can't start filtering out $A,$B,$C,$D,$E and $F values in the first or last 4-bit container of the byte..

I don't show the table (which you either need to type in manually or calculate in assembly), but it will work with BCD values if you make it point to BCD values:

table,X --> if X=25 in BCD this is 2=%0010 for first container and 5=%0101 for second, so X=%00100101 which will point to the 37'th value in the table. As the table is precalculated, that entry will contain 25^2 or 625, which is %00000110 for MSB (=06) and %00100101 for LSB (=25). And so on..


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 12, 2017 7:28 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
I think you could convert BCD to binary and divide by two with one table look up and use a 200 byte table like you mentioned.
On second thought, you have to add 1 if a and b are both odd, since halving them leaves two halves (which your code does I think).

Edit: Also, how to do you handle one odd and one even input? Don't you need to subtract one, then add it back after the table look up? I was trying things out on paper and couldn't get it to work out for one even and one odd, then I found this: http://www.trottermath.net/algebra/multsqs.html

Edit 2: It looks like you can do it taking even and odd into account too: http://codebase64.org/doku.php?id=base: ... iplication


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 13, 2017 12:16 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 13, 2017 8:48 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
BigEd wrote:
Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 15, 2017 2:35 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.

This does seem to be the fastest method. I got it down to 60 cycles:
Code:
;Method 10 - Quarter square with decimal index and BCD data
   .IF MultiplyRoutine==10
      LDX Arg1            ;Convert Arg1 from BCD to binary
      LDA HEXtoBIN,X
      TAY               ;Keep binary representation in Y
      LDX Arg2
      LDA HEXtoBIN,X      ;Convert Arg2 from BCD to binary
      STA QS1_Lo         ;Low byte of pointer into first table
      STA QS2_Lo         ;High byte of pointer into first table
      EOR #$FF            
      STA QS3_Lo         ;Low byte of pointer into second table
      STA QS4_Lo         ;High byte of pointer into second table
      
      SEC
      LDA (QS1_Lo),Y      ;Difference of low bytes from first and second table
      SBC (QS3_Lo),Y
      STA ResultLo
      LDA (QS2_Lo),Y      ;Difference of high bytes from first and second table
      SBC (QS4_Lo),Y
      STA ResultHi
   .ENDIF


Quote:
Clocks in at 53 cycles, but with a 512 byte table. Unceremoniously non-tested and non-optimized. For BCD, you only need a 200 byte table.
I still don't think that what you are showing is possible without checking if the sum is even or odd, making the sum even, then adding to the sum after the table fetch it the sum was odd. I would be glad to see how you do it though if you can. Here is my implementation of that process for BCD which takes 113 cycles on average:

Code:
   .IF MultiplyRoutine==11
      LDA Arg1
      BEQ .zero            ;If a+b is odd, we need to subtract one, so make sure
                           ;at least one multiplicand is above 0
         CMP Arg2            
         BCC .swap         ;Make sure larger argument ends up in Y, and smaller in X
            TAY
            STA R2         ;Save R2 for correction at end if a+b was odd
            LDX Arg2         
            BEQ .zero
            BRA .swapdone      
.swap
         TAX
         LDY Arg2
         STY R2
         BEQ .zero
.swapdone
            LDA HEXtoBIN,Y   ;Convert larger multiplicand from BCD to dec
            TAY
            STA R1         ;Store in R1 for EORing with other multiplicand
            LDA HEXtoBIN,X   ;Convert smaller multiplicand from BCD to dec
            STA R0         ;Save fr calculating a+b and a-b
            EOR R1         ;EOR a and b to see if a+b is even or odd
            ROR            ;Shift bit 0 into Carry for testing
            TYA            ;Load larger multiplicand in preparation for a+b
            BCS .odd         ;If odd, calculate differently
               CLD         ;Calculate a+b
               CLC
               ADC R0   
               ROR         ;Divide by 2
               TAX         ;Store in X for indexing into table
               TYA         ;Larger multiplicand
               SEC
               SBC R0      ;Calculate a-b
               CLC
               ROR         ;Divide by 2
               TAY         ;Store in Y for indexing into table
               SED         ;Diference of low bytes from square table
               SEC
               LDA Table13_Lo,X
               SBC Table13_Lo,Y
               STA ResultLo
               LDA Table13_Hi,X   ;Difference of high bytes in square table
               SBC Table13_Hi,Y
               STA ResultHi
               BRA .multiplydone
.odd
               DEC R0      ;Decrease smaller multiplicand to make a+b even
               ;Next 19 lines same as above
               CLD         ;Calculate a+b
               CLC
               ADC R0
               ROR         ;Divide by 2
               TAX         ;Store in X for indexing into table
               TYA         ;Larger multiplicand
               SEC
               SBC R0      ;Calculate a-b
               CLC
               ROR         ;Divide by 2
               TAY         ;Store in Y for indexing into table
               SED         ;Diference of low bytes from square table
               SEC
               LDA Table13_Lo,X
               SBC Table13_Lo,Y
               STA ResultLo
               LDA Table13_Hi,X   ;Difference of high bytes in square table
               SBC Table13_Hi,Y
               STA ResultHi
               CLC               ;Add back copy of larger multiplicand
               LDA R2
               ADC ResultLo
               STA ResultLo
               LDA #0
               ADC ResultHi
               STA ResultHi
      BRA .multiplydone
.zero
      STZ ResultLo         ;One multplicand was zero
      STZ ResultHi
   .ENDIF


I was also thinking the absolute fastest way would be to map three buffers into zero page that drive a 128k ROM and one address to read the output. That would be 32 cycles like this:
Code:
   LDA Arg1
   STA ROM_PtrLo
   LDA Arg2
   STA ROM_PtrHigh
   STZ ROM_Op
   LDA ROM_Out
   STA ResultLo
   INC ROM_Op
   LDA ROM_Out
   STA ResultHi


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 15, 2017 6:40 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Druzyek wrote:
BigEd wrote:
Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.

I wonder, would excess-3 notation be a better bet for this? Add $33 to both operands, and then the sum is $33 greater than the BCD sum - it all becomes very linear.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 15, 2017 8:11 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
I wonder, would excess-3 notation be a better bet for this?
Hmm, I've never heard of this. The Wikipedia page says it was used in some calculators. So, I made a spreadsheet with $33 added to both operands but it still looks like there is overlap. Wouldn't adding $33 to both operands make it $66 greater than the BCD sum? Do you mean to add it as BCD or just a binary add? I tried all the combinations and there is still overlap.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 15, 2017 8:43 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Hmm, I thought it worked out neatly, but trying it out I seem to be confused.


Top
 Profile  
Reply with quote  
PostPosted: Thu Nov 16, 2017 1:21 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Druzyek wrote:
Quote:
I think you could convert BCD to binary and divide by two with one table look up and use a 200 byte table like you mentioned.
On second thought, you have to add 1 if a and b are both odd, since halving them leaves two halves (which your code does I think).

Edit: Also, how to do you handle one odd and one even input? Don't you need to subtract one, then add it back after the table look up? I was trying things out on paper and couldn't get it to work out for one even and one odd, then I found this: http://www.trottermath.net/algebra/multsqs.html

Edit 2: It looks like you can do it taking even and odd into account too: http://codebase64.org/doku.php?id=base: ... iplication


I guess you are right there. We lose 0.5 in instances were one of the original numbers were odd, so doing a LSR is not going to work.

In fact, one should skip the division by two of the original numbers and double the array (to 512 words) to get a correct lookup table (and take that into account that you are retrieving a value 4 times what you intended). After subtraction you simply divide by 4.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: