I've been scheming up a project for a while now for a small portable computer (think Tandy Model 100) based around the 65C816 and incorporating 64KB of MRAM I got as a free sample (thank you, Everspin!) Since it's not necessarily going to incorporate any other form of on-board storage (I might work in some flash memory I've got on hand as a kind of notional hard disk, but we'll see - I'm trying to keep it simple,) I'd like to keep code size to a minimum so as to leave more room for whatever data I'm working on. To that end, I've done a 16-bit adaptation of a bytecode interpreter I'm using in another project, which I think should give pretty good code density by comparison to 16-bit '816 assembler (which is already pretty good, but I'd like to squeeze it down even further.) I figure I'll start a thread about it, partly to show off the fruit of my labors and partly to see if anybody can help me improve it. (Full code dump will come a little later, when I've gotten it into a satisfactory state of feature-completeness.)
In particular, at the moment I've just finished the division routine. Overall, I'm pretty happy with it, but it seems a little big to me, and I'm curious if anybody has any thoughts on cutting it down a bit. Here's the code as it stands:
Code:
divide: ; Common division code - divides #1 by #0 and gives the
dex ; quotient in #1 and the remainder in #0 (signed)
dex ; Decrement the data stack pointer for a work location
phy ; Save the instruction pointer to a temporary location
stz <$00,x ; Clear the sign-correct count
ldy <$00,x ; Clear the pre-shift count in Y
lda <$02,x ; Get the divisor from #0
beq .z ; If it's zero, implode
clc ; Clear the carry for the upcoming shift
bpl .pa ; If it's positive, don't negate it
eor #$ffff ; Otherwise, negate it
inc
sta <$02,x ; And store it back
sec ; Set the carry for the upcoming shift
.pa rol <$00,x ; Update the sign-correct flags
lda <$04,x ; Get the dividend from #1
bpl .pb ; If it's positive, don't negate it
eor #$ffff ; Otherwise, do that
inc
sec ; Set the carry (already cleared) for the upcoming shift
.pb rol <$00,x ; Update the sign-correct flags
stz <$04,x ; Clear #1 for use as the quotient
.ps bit <$02,x ; Check the MSB on the divisor
bmi .sd ; If it's 1, don't pre-shift any further
asl <$02,x ; Otherwise, shift it left one place
iny ; And increment our pre-shift count
bra .ps ; And see if we're done with the pre-shift
.sd ; The divisor is now pre-shifted as far left as it will go - begin!
; A contains the working remainder
cmp <$02,x ; Is the shifted divisor larger than the remainder?
bcc .ns ; If so, don't subtract it from the remainder
; Otherwise, do so
php ; Save the flag state first
sec
sbc <$02,x
plp ; And restore it
.ns rol <$04,x ; Whatever we do, shift the carry into the quotient
lsr <$02,x ; And shift the divisor right one place
dey ; And decrement the pre-shift count
bpl .sd ; If it hasn't underflowed, continue
sta <$02,x ; Store the remainder into #0
ply ; Restore the instruction pointer
lda <$04,x ; Load the quotient into A for sign-correction
lsr <$00,x ; Check - was the dividend negative?
bcc .cd ; If not, skip this bit
eor #$ffff ; Otherwise, negate the quotient
inc
pha ; And save it to a temporary location
lda <$02,x ; And get the remainder
eor #$ffff ; And negate it
inc
sta <$02,x ; And store it back
pla ; And retrieve the quotient
.cd lsr <$00,x ; Check - was the divisor negative?
bcc .d ; If not, skip this bit
eor #$ffff ; Otherwise, negate the quotient
inc
.d sta <$04,x ; And store the quotient back
inx ; Increment the data stack pointer (drop work location)
inx
rts ; Return to the calling routine
A few words of explanation: as you may be able to tell, the bytecode VM is stack-based, so the operands are addressed as DP,X with X being the stack pointer. I'm trying to do all work on the stack and avoid hard-coded temporary locations, for tidiness. The routine itself is binary long division on signed 16-bit operands, and the routine is careful to do the right thing with the quotient and remainder depending on whether the dividend and/or divisor are negative. (Specifically, that if the dividend is negative then the remainder is negative, so that quotient * divisor + remainder always equals the dividend.)
As far as trying to optimize it further, my priorities are:
- First, that it's as small as possible while still being decently performant.
- Second, that it's as performant as possible while still being small.
I want to avoid tables entirely (in order to keep it small,) and I want to preserve the end-result behavior (signed 16/16=16 division giving the quotient and remainder and handling negative operands correctly.) Beyond that, I'm open to input: can anybody improve on this?