I expected one or two negative comments and a general consensus that I devised an inadvisable memory map. I did not expect the weight and speed of negative reactions. Nor did I expect Dr Jefyll to provide the most tentative response.
Regarding software, my approach is to have applications which work within a 16 bit address-space and an executive which may or may not maintain a larger address-space. Differences in operating system ABI and processor instruction dialect can be handled dynamically within affected routines.
Regarding hardware, my approach is to "flatten the logic". In particular, I have found that the following operations may run in parallel:-
- Multiplexed address latching.
- Address segment decode using A13-A15.
- Read/write qualification.
Empirical testing found that new 74HC138 and 74HC139 operating at room temperature was able to invert a 64MHz crystal. I plan to use a raw crystal to obtain both clock phases or, alternatively, one 74HC157 in one of
Dr Jefyll's lesser remembered configurations which I call the
Scotty manoeuvre. Specifically, one 74HC157 may implement 4 bit transparent latch where each input may latch independently on high or low input. Where the address-space is 20 bit or less, this may replace one 74x373 or 74x573 and inverter, as suggested in the W65C816 datasheet. Assuming 74x157 operates at similar speed to the decoder chips, assuming 10ns setup time, assuming 5ns RAM, assuming tight physical arrangement, assuming low bus capacitance and assuming steady power, it should be possible to exceed 20MHz at 5V while only using 74HC series. A practical 30MHz design is possible with 74AC series, asymmetric clock and/or over-volting.
With a modicum of chip stacking, it is possible to place any two of address multiplexing, clock stretching, address decode or read/write qualification under one DIP 65816. This arrangement is unlikely to break the speed record. However, it aids fitting of dozens of DIP chips into 100mm*100mm and is therefore cheap to manufacture.
I have also discovered that half 74x74 may be used with active high 65xx peripheral chip select to implement a crude privilege system which may preclude access to I/O. The overhead of a privilege system may horrify real-time proponents. However, it is possible to connect 6522 chips with different interrupts while only one 6522 implements a privilege system. This could restrict access to a system timer and filing system but permit high speed audio sampling to RAM. Of particular note, the privilege bit does not hinder address decode speed.
Conceptually, I find separate consideration of memory map and clock stretching map to be extremely helpful. Often, they use the same circuitry - but not always. This is where the distinction is useful. As an example, it may be desirable to have A13-A15 decoded with 74HC138 to implement 6*8KB RAM, 8KB I/O and 8KB ROM. Meanwhile, 16KB of clock stretching may be decoded with a separate chip. Ignoring the rather elastic timing of each memory region, I have found that it is possible to approximate the memory map of Commander X16, Commodore VIC20, Apple II, latter 8 bit Acorn hardware and W65C265 using one 74HC138. This arrangement gained a reputation for being slow and cumbersome when NMOS 6502 PHI2 output (approximately 50ns latency) was connected to 74LS138 input enable (approximately 25ns latency). However, when using PHI0 input, 74HC or 74AC and separate read/write qualification, latency may be less than 15ns. In combination with 600nm CMOS processors and larger, faster SRAM, it is quite feasible to run legacy binaries at more than 10 times their original speed. I would not have considered this arrangement without the popularity of the Commander X16 project.
As a bonus, the unmapped upper address lines of modern RAM and ROM may be connected to a transparent latch such that 6 bit A16-A21 may be connected directly to RAM and 2 bit A22-A23 may be connected directly to ROM. This allows any permutation of 63 banks to be used with one native and three foreign ABIs. This allows, for example, an unmodified Acorn BASIC binary to use an Acorn ABI while EhBASIC concurrently uses a VIC20/Commodore 64/Commander X16 ABI while the base system concurrently implements native 65816 vectors. Essentially, it is possible to solve the Commodore/Acorn/65816 vector conflict with fairly trivial 15ns glue logic. Unfortunately, this arrangement breaks stack introspection on page one using TSX // AND $102,X or similar. Fortunately, the most common case is an executive routine which flips a carry flag. I presume that drogon or Werner Engineering have already handled these cases. Regardless, all ABIs can be implemented using one slow, cheap 28C256 (possibly
programmed with XR2801 EEPROM programmer, possibly connected to a privileged 6522).
The design can be easily implemented in CPLD or FPGA. However, I strongly advise not using more six bits of input for address decode when using Atmel CPLD, not using more than five bits of input when using Xilinx FPGA and not using more than four bits of input when using Lattice iCE40. Anything more is an unnecessary violation of "flatten the logic". In the general case, for contiguous regions of memory, that means not decoding blocks smaller than 4KB, definitely not decoding blocks smaller than 1KB and never processing A16-A23.
Finally, I have a quick message to one proponent of "flatten the logic" who latches A16 and then incorporates this and more than six other inputs into CPLD equations: Firmware modification will remove more than 5ns latency from address decode. This would allow, for example, POC1.3 to run faster than 20MHz. Unfortunately, it requires dividing the RAM into two or more discontiguous regions. Oh, wait. POC1.3 does that already.