Thanks for the answers!
SPI has 4 different modes, I would implement only "mode 0" since it seems to be the common "language" among almost all of the SPI capable devices I would need. But anyway, all SPI modes has the attribute to have rising or falling edge of the clock to "read" the signal (ie, read or "capture" MISO by the master) and the other (the opposite) edge to "write" the signal (ie, write or "propagate" MOSI level by the master). I am not sure if a single '299 can do this, as it seems to have a single "CP" pin for clock and the truth table only mentions low-to-high transition (= raising edge) for operations. It would be nice since I am not so familiar with more complex 74' parts other than simplier gates, the usual '138 and similar "usual" stuffs for a more-or-less beginner etc
The other problem: as far as I can understand, the '299 has a single register, I can store a value which is shifted out, while the "received" data is shifted in, to replace (?) the original value in the shift register bit-by-bit. However, for some tasks like SD card interfacing I would need to read a block which requires the output of $FF on MOSI while reading data on MISO from the card (well, SPI is full duplex synchronous serial bus after all). Surely, it's not a problem as I can "refill" the shift register $FF after each shift (well, actually it means to keep MOSI on 1 always, so technically it can be solved in a different way too without providing "true" full duplex interface), but it's a performance bottleneck then, and I would lose the ability to use block move opcode of 65816 which is kinda nice (and would allow to put I/O to the non-zero bank (which is nice: simplier address decoding!) without too much performance problem long addressing would mean, because block movement is the same in speed from any bank, if I am correct).
What I thought for I/O possibilities for the CPU, like this (remember, SPI for real does not have read and write mode, always happens the same time so I name that "transfer" for a 8 bit transaction):
CPU read (read on IO-addr-0): shift-in (MISO) result of the previous transfer (but do NOT start a new transfer for now!)
CPU read (read on IO-addr-1): shift-in (MISO) result of the previous transfer AND start a new transfer with the last written shift-out value by the CPU
CPU write (write on IO-addr-0): shift-out register which will be used on the next transfer (but do NOT start a new transfer for now!)
CPU write (write on IO-addr-1): shift-out register AND start a new transfer
So basically CPU R/W would select shift-in or shift-out register and a single bit difference in the I/O address decides that only reading/writing the shift register, or also the start of a new transfer then. SPI would be clocked to have faster than the minimal possible CPU clocks to access the registers again, so no need for busy flag, etc. This seems to be odd (SD card specification for example writes about max of 400KHz clock - or something like that - before card is identified), but eg the SD-card cartridge for Enterprise-128 (Z80 based though, but never mind) has SPI interface implemented in a CPLD clocked at constant 16MHz - so Z80 would never worry about unfinished 8 bit transfer -, and no problem with most SD cards still. Well, here and in my whole post, SD card is only (but important) example to use SPI for, of course, than can be others (like ENC28J60 for ethernet). This interface would need (I guess) at least two shift registers (for "write" and "read") so maybe one '299 is not enough, but seems not to be overcomplicated yet, still. Or whatever
The trick would be that no need for the I/O address of the SPI interface decoded from the LSB of the CPU address actually. It's because block movement opcode in 65816 has the "problem" that both of source and target address is incremented (well, or decremented) after each copy of a byte. Thus, if we need at least read/write (let's say) 512 bytes with block move op in a clean way (fast enough, short code), then at least 512 continual addresses needed for the I/O with the very same purpose (again, with I/O not in bank zero of 65816, it's usually not problem to "waste" addressing space, unless you need 16MByte RAM and minimal "waste" for I/O ... not my case, 512K SRAM would be enough, practical and also cheap).
My post seems to be somewhat eclectic, since I talked about "not so much performance oriented method" (eg with semi-software solution), but somehow I also feel that performance is not a bad thing of course, if it can be got with a little more complexity. The unused ops of 65C02 is a good idea for the other solution, but I would go with the 65816.
I am not sure how dumb ideas I have, so let's just laugh on me
I am really not so experienced but I would like to start to build something at last!!