Just so you have some perspective, the parallel SCSI subsystem in my POC V1.1 unit has a raw bus speed of 3.5MB/second and a theoretical throughput to/from core of ~750KB per second while handling 32KB blocks of data. During actual testing, a 32KB block loaded in ~45ms ±4ms, which is a computed throughput of ~728KB/sec. I used the counter/timer in the DUART running at 250 Hz to track the performance—which is where the 4ms uncertainty comes in. In that test, the 65C816 was running at 12.5 MHz, and reading and writing from/to the disk was into bank $00 without use of indirection. This is being done with interrupt-driven code and some very tight loops, along with a SCSI ASIC that handles the bus protocol in hardware. If I had to do it entirely in software, I'd be lucky if I got 200KB/sec.
728KB/sec may sound pretty fast, but it's only about one-fifth the speed of the SCSI bus—and I'm running the bus at it's slowest rate (SCSI1-1986 speed, to be precise). There is a certain amount of unavoidable overhead, such as the disk seeking, bus protocol, etc. However, the real roadblock is the 65C816 itself. It can only move bytes so fast, and that "so fast" is much slower than the capabilities of the SCSI hardware. The performance solution would be a DMA controller operating on a two-clock bus cycle. That's a project for another time.
Something to be aware of is reading data from a fixed address with a 65C816, as would be the case with disk I/O, will involve long indirection if the data is going into or coming out of a different bank than the one in which the I/O device is located. Indirection of any kind costs clock cycles because it involves additional internal steps in the MPU. Any 24-bit load or store will incur a one clock cycle penalty for each access. If the access involves a read-modify-write operation, e.g. INC $123ABC, there will be multiple penalties, because such an operation requires multiple accesses. Most mass storage accesses don't involve R-M-W operations, but still require indirection and—usually—indexing, which also adds cycles. If bytes are gotten from one bank and written to another, the indirection involves an "extended" pointer (a 24-bit pointer) on direct page to hold at least one of the addresses. So there will be the double-whammy of indirection and 24-bit access.
This is where cycle counting gets into the picture. It's how you can predict the performance of your code. When you start adding up all those twists and turns the code will take during a mass storage access you'll wonder where time went.