Well I've been doing embedded systems development of one kind or another for 35+ years, and this is the first division routine that I've ever actually written. I've recently implemented a Goldschmidt square root and inverse square function, and used the square of the inverse square root to perform division by multiplication with the inverse of the divisor. Since the Goldschmidt algorithm uses multipliers to iteratively arrive at its results, I don't really consider that an implementation of a division algorithm per se.
Therefore, the following source contains my first SW division routine. I've written it to be part of the run-time library of my Pascal compiler, so it finds its operands on the system stack. The stack-relative addressing mode and the register stack has been put to use in constructing the routine. The return address is at offset 1, the divisor is at offset 3, and the dividend is at offset 5. When finished, the ATOS register holds the quotient, and the ANOS register holds the remainder. At initialization, ATOS holds the dividend, and ANOS is initialized to 0. A non-restoring algorithm is used for the implementation. The file has been assembled with the PyAsm65 assembler, and I loaded and tested the routine in Py65.
Code:
( 1) ; ; unsigned division 16 x 16
( 2) ; .cod 512
( 3) ; _idiv .proc
( 4) 0200 A900 ; lda #0
( 5) 0202 0B ; dup a
( 6) 0203 CBB505 ; lda.w 5,S
( 7) 0206 A010 ; ldy #16
( 8) ; _idiv_Lp
( 9) 0208 18 ; clc
( 10) 0209 AB0A ; asl.w a
( 11) 020B 1B ; swp a
( 12) 020C AB2A ; rol.w a
( 13) 020E B006 ; bcs _idiv_Plus
( 14) ; _idiv_Minus
( 15) 0210 38 ; sec
( 16) 0211 CBF503 ; sbc.w 3,S
( 17) 0214 8004 ; bra _idiv_Next
( 18) ; _idiv_Plus
( 19) 0216 18 ; clc
( 20) 0217 CB7503 ; adc.w 3,S
( 21) ; _idiv_Next
( 22) 021A 1B ; swp a
( 23) 021B 3002 ; bmi _idiv_Dec
( 24) 021D AB1A ; inc.w a
( 25) ; _idiv_Dec
( 26) 021F 88 ; dey
( 27) 0220 D0E6 ; bne _idiv_Lp
( 28) ; ;
( 29) ; _idiv_Exit
( 30) 0222 1B ; swp a
( 31) 0223 AB090000 ; ora.w #0
( 32) 0227 1004 ; bpl _idiv_Finish
( 33) 0229 18 ; clc
( 34) 022A CB7503 ; adc.w 3,S
( 35) ; _idiv_Finish
( 36) 022D 1B ; swp a
( 37) 022E 60 ; rts
( 38) ; .endp
( 39) ; .end
For 32767 / 10 = 3276 rem 7, the routine requires 378 cycles. For 32767 / 0 = 65535 rem 32767, the routine requires 410 cycles. For 1 / 1 = 1 rem 0, the routine require 350 cycles. I've not exhaustively tested the routine, but it looks like the execution time ranges from 410 to 350 cycles. If the architecture supported register to register addition/subtraction, some more cycles could be saved. As it is, with the quotient and remainder maintained in the A register stack, a considerable number of memory cycles were saved. For the routine, the average CPI is 1.88 cycles, and the average instruction length is 1.77 bytes. I found these results interesting given that the stack-relative addition and subtraction instructions are 3 bytes long and require an additional 2 cycles to read the divisor from the stack: a total of 5 cycles.
Would like to gauge my implementation with my M65C02A architecture against a similar division routine implemented on the 65816. Since the '816 has a more 16-bit like architecture and does not rely on prefix codes like the M65C02A architecture, I suspect that there will be some advantage to the '816 over the M65C02A. Maybe the register stack that is part of the M65C02A A, X, and Y registers may offer some advantage to close the performance gap.