First, as I mentioned, I'm not too familiar with the 65816, so feel free to correct any misperceptions I have about its addressing modes or whatever.
BigDumbDinosaur wrote:
cjs wrote:
Hm! So that sounds like yet another reason to move the direct page to cover your I/O addresses when loading data, and then run the load routine in the bank you're loading. That would fix this problem, would it not?
...While accessing an I/O register as a direct page location does eliminate one clock cycle per access—assuming DP points at a page boundary—...
Um, I think two cycles per access in the situation were were talking about, right? I was responding to your earlier comment:
BigDumbDinosaur wrote:
Something to be aware of is reading data from a fixed address with a 65C816, as would be the case with disk I/O, will involve long indirection if the data is going into or coming out of a different bank than the one in which the I/O device is located. Indirection of any kind costs clock cycles because it involves additional internal steps in the MPU. Any 24-bit load or store will incur a one clock cycle penalty for each access.
What I understood you to be saying here is that to load data from the address used for input from your device, say, $C012, you expected one would be using absolute long
LDA $00C012, a 4 byte/5 cycle instruction. (I don't actually see any indirection here, though; the address is being used as given, not loaded from another address.) I proposed replacing that with
LDA $12 with DP set to $C0 (2 bytes/3 cycles).
I'm not seeing any issue with the target location of the data transfer, so long as you're willing to limit single transfers to 64K or less: just load an index register with $10000 - length, set the operand of your STA instruction to destaddr - $10000 - length, and loop until the index hits 0. Yes, this requires self-modifying code, but it's pretty innocuous as far as self-modifying code goes. Also, it means you need not worry about the DBR if you don't want to; STA seems to be the same number of cycles for for absolute and long indexed X, according to the WDC book.
That said, when I look at the actual old and new code itself:
Code:
.loop LDA $00C012 ; 5 cycles
STA (zp),Y ; 6 cycles
INY ; 2 cycles
BNE .loop ; 3 cycles
; total: 16 cycles
.loop LDA $12 ; 3 cycles
STA $xxxxxxx,X ; 5 cycles
INX ; 2 cycles
BNE .loop ; 3 cycles
; total: 13 cycles: 20% speedup
It's only about a 20% speedup on the bulk data transfers themselves, which may or may not be worth it, depending on how much other overhead you've got, whether you're doing multi-block transfers, and so on.
Quote:
I went through this exercise when I was designing the SCSI and multi-channel UART drivers for my POC V1 units.... Pointing DP at hardware not only proved to be of no value in performance, it resulted in a a lot of hoop-jumping in order to get at things such as indices and pointers that were needed by the driver.
If the driver needed a bunch of indices and pointers, yeah, you'd want the zero page pointing at those. I was talking about the above just for the transfer itself. And I suppose I had a bit of 6809 on the brain, where you have a few more registers and addressing modes for helping to handle this kind of thing. (E.g., you could push a list of transfer descriptors containing block numbers and other information on the user stack, easily index into individual descriptors relative to the user stack pointer for getting the information about each transfer, and pop them off as you work through your transfer list.)