BigDumbDinosaur wrote:
It [MVN] can fill upwards of 50 KB in about 35 milliseconds with a 10 MHz Ø2 clock—several times faster than possible by using STZ in a loop.
Since MVN and MVP only move 1 byte every 7 cycles, a fill routine that uses a 16-bit store only needs to be less than 14 cycles, not less than 7, to be faster per byte. The analysis here holds for STZ too:
http://6502org.wikidot.com/software-65816-blockfillBigDumbDinosaur wrote:
The above would be even faster if it were always clearing an even number of bytes, as .A could be set to 16 bits and thus clear two bytes at a time with only a one cycle penalty, plus an extra DEY to step the counter by twos.
Why not divide the length by 2 (i.e. shift right) outside the loop and use a single DEY, since the common case is that the loop is executed multiple times? In fact, the farther you unroll the loop, the closer you can get to 2 cycles per byte, e.g.:
Code:
; S = end address, A = length, Y = fill value (switched around from the above)
; m and x flags assumed to be 0 (16-bit accumulator and index register)
;
; untested
;
lsr ; divide length by 8...
lsr
lsr
.1 phy ;4 ...and fill 8 bytes at a time
phy ;4
phy ;4
phy ;4
dec ;2
bne .1 ;3
There's also the PEI bank 0 memory move trick which is twice as fast as MVN/MVP (and could be used for filling, though it's slower than PHA or PHY), described here:
http://6502org.wikidot.com/software-65816-speed