Actually, subtract-10^N-and-count is generally faster than divide-by-10 (as is usually done in Forth)...in many cases a lot faster. The 10^N method will give you fairly uneven timing (consider displaying 0 vs. 99 e.g.), though, which may not be ideal for game applications. You can speed up the 10^N method (and make the timing more even) by subtracting 2^K * 10^N (K=0 to 3). Here's an (untested) example for an 8-bit number (in the accumulator):
Code:
LDY #9
L1 LDX #0
STX DIGIT
LDX T2,Y ; X = number of times to loop for this digit
L2 CMP T1,Y
BCC L3
SBC T1,Y
L3 ROL DIGIT
DEY
DEX
BNE L2
JSR OUTPUT_DIGIT
CPY #0 ; CPY, unlike TYA, won't overwrite the accumulator
BPL L1
RTS
T1 DB 1,2,4,8,10,20,40,80,100,200
T2 DB 0,0,0,4,0,0,0,4,0,2
Note that the above will display leading zeros. There are several ways the above can be optimized for a specific situation. Another possibility is to use unpacked BCD, which is just one digit per byte. You can then do a (16-bit) A = A+B addition (say, for scorekeeping) with:
Code:
CLC
LDA A0
ADC B0
CMP #10
BCC L1
SBC #10
L1 STA A0
LDA A1
ADC B1
CMP #10
BCC L2
SBC #10
L2 STA A1
or an A = A+1 (say, for timekeeping) with:
Code:
INC A0
LDA #10
EOR A0
BNE L1
STA A0
INC A1
LDA #10
EOR A1
BNE L1
STA A1
L1
and then eliminate the conversion entirely. The trade-off is that unpacked BCD is less memory efficient.
There are various techniques for speeding up division, but it depends on what your needs are, e.g. whether you need both a quotient and a remainder or just one of the two, whether you need an exact quotient or an approximate quotient will suffice, the input and output value you'll be using, etc.
One thing to consider is how much faster you need it to be. If you only need, say a 5 or 10% speed increase, there are probably ways to optimize the routine you're already using to squeeze out enough cycles. On the other hand, if you need it to be 2 or 3 times as fast, then you'll probably have to consider other options.
Another thing to look at is the values that will be passed to the division routine and the values that will be returned. For example, a general-purpose 24-bit shift-and-subtract division routine can be sped up if you know that the quotient will always be <256.
It may sound silly, but the best way to speed up division is not to do any. You haven't mentioned what you're using division for, but it may be worthwhile to re-think the problem to see if the division is really necessary. In many situations it's possible to eliminate division entirely by maintaining an extra variable or two as you go along.
Another possibility -- if you only need a quotient, and you're dividing by the same number frequently -- is to use a routine (or table) intended for a specific denominator. The divide-by-7 routine from the 12/84 Apple Assembly Line (see the link at the top of the Source Code Repository page) is one example of this type of routine. Basically, it's the division counterpart to the *10 routines that do a *2 + *8 or (*1 + *4) *2 calculation, such as:
Code:
STA TEMP
ASL
ASL
CLC
ADC TEMP ; A*4 + A = A*5
ASL ; A = A*10
Like its multiplication counterpart above, you must re-write (or re-generate) this type of division routine when the divisor (denominator) changes. You can write a program to generate an efficient division routine for a particular divisor (I once wrote such a program), and then simply generate a new division routine for a new divisor. However, if you're changing divisors frequently, the time it takes to generate a new routine will likely eat up any time saved by the division routine. In other words, the speed increase comes from using the generator infrequently, and using the division routine frequently.
An example of a long shot would be to investigate a technique like a/b = N ^ (logN(a) - logN(b)). Usually, the N^x and logN(x) calculations are much slower than division, but if you can think up some clever way of doing (or approximating) them with tables you've reduced division to subtraction.