In for a penny, in for a pound!
The K2416 microcode is largely done, but it's so close to a full Native Mode 65816 implementation that it seems a shame to stop short. After kicking it around with Dr Jefyll and ttlworks, it seems worthwhile to consider "software compatibility" with the 65816. Most of the required registers are in the design already by virtue of K2416, so the main issue is the 65816's wider internal datapath.
Taking the TTL ALU to a full 16-bits is not realistic. Instead, the approach must be to add cycles as necessary to deal with 16-bit quantities. The 65816 itself does this in many situations. For instance, instructions that access data memory add a cycle when 16-bit memory is enabled. That behaviour can be easily matched with an 8-bit internal datapath. But there are definitely situations where the wider path beneficial. Most notably, the 65816 is able to adjust the high-byte of addresses without an additional cycle. Absolute Indexed instructions, for example, take the same number of cycles whether or not a bank boundary is crossed. Similarly, a 16-bit value that straddles a bank boundary does not force an extra cycle.
In these and other situations, the TTL CPU would need an extra cycle to do the same job. If we're careful, this extra cycle would need to be added only if the specific circumstances arise (that is, only if a boundary is indeed crossed). The TTL CPU already has mechanisms that check the internal carry during address calculation, and to add an extra cycle if necessary. Extending this mechanism to a new set of conditions will take some effort, but the approach seems promising.
An initial review suggests that an 8-bit ALU would require extra cycles as listed below (to be clear, extra cycles above and beyond the cycle-count that the 65816 would use in the same circumstances). Specifically, would need an extra cycle when:
- The DL register triggers a page-crossing,
- Stack-relative addressing triggers a page-crossing,
- A 16-bit increment or decrement instruction overflows or underflows the low-byte (e.g. INY),
- A 16-bit value straddles a bank boundary (when the M flag = 0),
- A 16-bit register is transferred (e.g. TXA when the M flag = 0),
- 16-bit indexing is used with Direct Indexed Addressing, or with Absolute, Long or Indirect Indexed Addessing and a bank boundary is crossed,
- 8-bit Direct Indexed Addressing crosses a page boundary,
- A Decimal Mode operation is used (which is unrelated to the 16-bit datapath but still an exception).
As an example of how this might play out, consider an ORA (dir),Y instruction where the DL register set to a non-zero value and both the M and X flags set to 0 (i.e. Direct Page is offset and we have 16-bit memory and registers). Here we have three possible additional cycles. The first is required if the sum of DL and the operand crosses a page boundary, a second if indexing with Y crosses a bank boundary, and a third if the 16-bit data value being accessed straddles a bank boundary. So, whereas the 65816 takes 8 cycles to complete this instruction, the TTL CPU would take 11 cycles if all boundary conditions were triggered (and 8 if none were).
The same penalties would apply to various other addressing modes (see table below), but not to stack push and pull operations. For those we can use the 16-bit incrementer to manipulate the stack pointer in a single cycle. The current incrementer circuit would work as is for pull operations, and with a small change as a decrementer for push operations. The CPU would then be cycle-accurate for all instructions that require these stack operations (including BRK, RTI, JSR, RTS, and all register pushes and pulls).
Another potential exception is, oddly enough, the block move instrutions -- MVN and MVP. They each take 7 cycles to iterate through every byte to be moved. The looping action can be implemented by inhibiting FetchOpcode while the block move is in progress, essentially re-executing the same opcode for every byte until the block is done. The X, Y and C registers are used as 16-bit loop variables to maintain the source address, destination address and loop-count respectively; and DBR is used as a temporary register to hold the destination bank. Honestly, the opcode feels more than a little awkward, but you can't argue with the economy of it all.
Ok, so what extra hardware is required for all this? Here is what I see so far:
- The 16-bit incrementer needs to also handle decrement operations,
- A dedicated incrementer is required for the upper 8-bits of the address bus (the existing incrementer is fixed to the lower 16-bits and cannot be reused without adding signficant switchig delay),
- The incrementer carry must be tested (in addition to the ALU carry) to determine if an additional cycle is needed,
- The cycle-adding mechanism needs detect the various conditions listed above (including the ability to inhibit FetchOpcode for MVN/MVP),
- The datapath needs to be extended to support direct transfers from X and Y to the address bus for MVP and MVN.
This is a fairly contained set of changes to hardware. There is a lot of microcode gymnastics involved (and squeezing yet more control signals onto an already full control RAM gives me some pause), but on the whole nothing here looks prohibitive. Interestingly, the erosion in performance seems tolerable, which is to say that the additional cycles strike me as a very reasonable tradeoff. A little bit of care while programming would minimize them, but we're not likely to quibble about a cycle or two at 100MHz!
One interesting variation is that Emulation Mode in this case would be fully NMOS compatible. This would allow the CPU to run fussy C64 software on boot up. The downside of course is that existing software that relies on the 65816's extended Emulation Mode would not work correctly. (I suppose to be truly general one would need to be able to select between 65816, 65C02 and NMOS 6502 compatible Emulation Modes ... hmmm).
Alright, so is it full-steam ahead with the 65816? Well, I'm not sure yet. I'm still working things out and a show-stopper may yet materialize (or perhaps a collection "show-dampers" that amount to a show-stopper). I thought I would share the progress to date nevertheless. Let me know if I'm missing something obvious (which is likely given that I've never programmed a 65816). As always, any and all input is very much welcome! (Thanks Dr. Jefyll for working through a couple of issues with me).
Cheers for now,
Drass
P.S. By way of illustration, the table below compares 65816 cycle counts with those of the TTL CPU in Native Mode. The opcode in question is an ORA, although the same cycle counts apply to several other opcodes.
Code:
MODE 65816 TTL CPU
--------- ----------- -------------
(dir,X) 7-m+w +x0, +dx, +mx
stk,S 5-m +sx
dir 4-m+w
[dir] 7-m+w +mx
imm 3-m
abs 5-m +mx
long 6-m +mx
(dir),Y 7-m+w-x+x*p +ix, +dx, +mx
(dir) 6-m+w +mx
(stk,S),Y 8-m +ix, +mx
dir,X 5-m+w +x0, +mx
[dir],Y 7-m+w +ix, +dx, +mx
abs,Y 6-m-x+x*p +ix, +mx
abs,X 6-m-x+x*p +ix, +mx
long,X 6-m +ix, +mx
Legend:
--------------------------------------
m = 1 if m flag is on
x = 1 if x flag in on
w = 1 is DL != 0
p = 1 if a page boundary is crossed
x0 = 1 of X Flag = 0
dx = 1 if DL + Operand crosses a page boundary
mx = 1 if the 16-bit data value straddles a bank boundary
sx = 1 if stack-relative addressing crosses a page boundary
ix = 1 if indexing crosses a bank boundary