Multi-Core 64 Bit SIMD On AVR
Posted: Tue Aug 11, 2020 3:53 am
I'm working on a 64 bit SIMD extension to 6502. Initially, I was working on a 64 bit RISC SIMD processor architecture without 6502 compatibility. However, various iterations of the design had poor instruction density, poor execution speed, a 350KB virtual machine or unwieldy hardware cost. Furthermore, none of these designs had a compiler, operating system or applications. Reversing my design into 6502 solved a large number of problems while only introducing minor complications; mostly around stack and flags.
I started with the 65CE02 instruction set. 65CE02 can be concisely described as a superset of NMOS 6502 with bit operations and an additional index register called Z. I thought that 65CE02 was a good basis for a RISC extension because A, X, Y and Z logically map to 2^R registers commonly found in a RISC register file. I am willing to be corrected if this is erroneous.
I discarded the instructions which make 65CE02 useful as a microcontroller in, for example, a keyboard. I replaced them with more useful instructions for 64 bit computing. The base 65CE02 instruction set is otherwise unchanged, although it is moderately escaped. Specifically, 65CE02 bit flip instructions xxxx0111 (0x?7) are now SIMD prefix instructions and 65CE02 bit test instructions xxxx1111 (0x?F) now indicate two byte register-to-register RISC instructions. Potentially, the RISC instructions could be an exact replica of ARM Thumb instructions, with the exception that instructions would not be necessarily word aligned - and the first two nybbles would be swapped. At a tentative stage of development, it is significantly easier to implement a four octal digit encoding which allows operations of the form R5=R6*R7. To reduce encodings, commutative operations, such as addition, multiplication, AND and OR may share opcodes such that an operand may be doubled or squared at the expense of uninteresting logical operations.
Opcodes 0x07, 0x17, 0x27 and 0x37 provide prefixes for a smattering of 8, 16, 32 and 64 bit SIMD operations on A, X, Y and Z. These SIMD prefixes may be combined with RISC instructions. Opcodes 0x87, 0x97, 0xA7 and 0xB7 provide prefixes for the same smattering of 8, 16, 32 and 64 bit SIMD operations on an alternate set of registers which I tentatively name C, I, J and K. These prefixes also permit a second set of RISC SIMD opcodes. For example, operations with saturation arithmetic. (As stipulated by Edson de Castro as part of the Data General Eclipse 32 bit extension, all of this can be achieved without mode bits. For more detail, see The Soul of a New Machine by Tracy Kidder. Perhaps William D. Mensch, Jr.'s 16 bit extension would be less hated and the fabled 32 bit Terbium extension would be more tractable if similar practice was observed.)
The astute will notice that I have only defined half of the prefix instructions. The remainder are reserved for future expansion. The most obvious extension is to double the number of registers. However, this obtains diminishing returns, especially when zero page is available. It is possible to define 128 bit, 256 bit, 512 bit and 1024 bit operations. This is not completely outlandish. Intel and Xtensa already have 512 bit hardware, a GPU with 1024 bit bus is available and ARMv8 specifies 2048 bit SIMD. Alternatively, unused prefixes can be used to provide more opcodes. Most obviously, 32 bit and 64 bit floating point operations within the main registers. Other candidates for floating point precision include 16 bit, 40 bit, 48 bit, 80 bit, 96 bit and 128 bit.
Processor architecture address-space varies considerably and this is complicated by architecture extension, vendor page-banking schemes and considerations for privilege protection and multithreading. A processor architecture typically defines a subset of:-
From my experiments, I have found that it is possible split and grow program, data and stack segments using unrelated techniques. Stack segment may exceed 16MB with no ill effect. Data segment requires architecture extension. Program segment growth can be mostly solved by using longer branches and jumps. The awkward case is a data segment reference to program segment. A fairly general case is C++ vtable. As noted by Kerringham and Pike, a related problem occurs with device drivers. In this case, it is possible to stretch a 16 bit pointer by 3 bits or more and implement a prescaled indirect jump to an 8 byte, 16 byte, 64 byte or 256 byte boundary. This alignment is only required on targets referenced from data segment. Regardless, with very little modification, it is possible to run a 6502 binary which may 512KB - or considerably larger - without the use of overlays or page-banking techniques.
I'm trying to keep pointer bit shifts to a minimum because I intend to implement this on an AVR microcontroller. I believe that I am using the same model as Klaus Dormann. Klaus Dormann is quite a luminary among 6502 enthusiasts; moreso than most of the 6502's original design team. Klaus Dormann is best known for an extensive 6502 test suite and a 6502 emulator written in AVR assembly. The emulator is available on standard terms for non-commercial use. On a 20MHz ATMEGA328P, emulation runs at approximately 1MHz. There is a minor hobby of overclocking ATMEGA devices to 24MHz. This would provide approximately 1.2MHz emulation, although this is strictly outside of specification. Unfortunately, this is an optimistic case. An ATMEGA328P is the main component of an Arduino Nano and its $5 clones. These boards are typically underclocked to 16MHz to allow use with an unregulated lithium battery. (In this configuration, it is strongly advisable to avoid execution of the HCF opcode.)
After several months of consideration, I remain somewhat awed by the mind-bending insanity of implementing one processor architecture directly on another. For much of the process, it requires holding two disperate register sets in mind while implementing an almost bit-perfect mapping between them. AVR is almost a strict superset of PIC and Z80. Therefore, the task is fairly similar to writing a 6502 emulator in Z80 assembly. Furthermore, Klaus Dormann achieved this with almost clock-cycle accuracy. That's impressive.
I'm attempting the lazier task of writing an implementation in C and letting a compiler do the majority of the work. Unfortunately, the output of avr-gcc is a constant stream of WTF. Compilers for other microcontrollers are worse. Gotchas include the inability to allocate 32 bit or 64 bit globals correctly. Thankfully, a processor register context can be allocated within main() - assuming that the compiler didn't optimize away main(). Hmm. Perhaps sidestepping avr-gcc is the most sensible option.
If you don't care about SIMD on legacy instructions, don't care about commercial use and you're willing to write AVR assembly, more than 90% of an ATMEGA328P's EEPROM is available to extend the 6502 processor architecture without affecting the performance of core instructions. If you don't care about performance, do care about your sanity and you are favorable to the Lesser GNU Public License, it is possible to implement some or all of the architecture using C or C++ while using any desirable permutation of Arduino development environment, library, bootloader and run-time. If you find the Arduino IDE too patronizing, a Makefile is widely available to compile a binary, strip a binary, deploy a binary and verify a binary. This can be amalgamated with a Makefile for testing and archiving. From here, you're only a few steps away from abandoning Arduino libraries and Arduino boards. You'll be using a USB-to-SPI programmer on an Arduino clone board which is supplied blank without the Arduino bootloader. Then you'll be soldering an oscillator to a DIP-scale AVR and improvising a 6 pin ICSP connector from scrap wire. You may develop a one AVR per week habit. Maybe more. But you're in control. You're not an addict. You can quit at any time. Eventually, you'll join a support group, such as forum.6502.org. ("Hello. My name is ________ and I am a 6502 on AVR emulation addict.")
Anyhow, Arduino is a significantly more inclusive proposition and especially so given the wide distribution of Arduino hardware and support. A diligent and not particularly smart 14 year old could probably add an opcode to a working implementation. Speaking as a former diligent and not particularly smart 14 year old, it would have been an *amazing* opportunity to have a fully extensible 6502, written in C, running at a reasonable speed.
#include <rant/kids-today>
#include <rant/both-ways/snow>
#include <rant/both-ways/uphill>
#include <rant/lawn/get_off>
Getting back to 16 bit data segments, it is possible to connect a $2 microcontroller to a bank of memory and allocate 64KB data segments on 64KB boundaries. This requires no bound check and no exception processing. This saves considerable overhead when implementing a 64 bit processor architecture on an 8 bit microcontroller while allowing 6502 programs to each run in a separate memory space. A read-only program segment may be unrestricted size. Bound check is only required for branch or jump which crosses page boundary. Likewise, stack segment only requires bound check when crossing page boundary. Meanwhile, data segment and zero page never require bound check. These memory interface routines may be selectively compiled with an interpreter for 6502, 65C02, 65CE02, 65816 and the extensions of your choice. In my case, that's a 16 bit, 32 bit and/or 64 bit RISC extension. Unrolled loops for 64 bit SIMD barely fit into the 32KB EEPROM of an ATMEGA328P. If a 32 bit extension is sufficient, there is probably enough space to include a 6809 or Z80 interpreter. Or floating point instructions. Or perhaps some cryptography instructions.
While I'm vaguely competent at writing C and soldering oscillators, I doubt that I'd ever get an AVR running in a genuine 6502 or 6510 socket. The first hurdle is a lack of suitable test equipment. Regardless, instruction set extensions overlap with a project of more general interest. Specifically, for applications which are not performance sensitive, an AVR to 6502 adapter board would be cheaper than a 65816 to 6502 adapter board. It also has the intriguing possibility that extra functionality can be added. This includes wider registers, hardware multiply and native support for SWEET16.
Unfortunately, it also comes with a major disadvantage. Fully synchronous bus operation requires a three clock cycle busy loop. In a legacy system with a 2MHz synchronous bus and 6845 video adapter (or similar), a binary which doesn't look too closely at flags held on stack will run at a maximum of 40% efficiency. This is likely to break any tape or disk storage which uses timing loops. It also borks some games. However, it is sufficient for most productivity software, music software, almost anything written in BASIC and much more besides. Some efficiency gains can be made by caching zero page and/or stack. This reduces the number of bus synchronization cycles. However, eliminating writes to zero page breaks one of Commodore's incompatible page banking schemes. Likewise, eliminating writes to stack may break stack introspection via the X register. Fortunately, this is a rare case. In general, the maximum aggressiveness of caching is determined by host and application. At the expense of minor compatibility, it is possible to exceed Klaus Dormann's current results when a 20MHz AVR microcontroller is socketed in a legacy host.
In particular, significant gains can be achieved by placing a breakpoint on a legacy host's integer and/or floating point multiply routines. The breakpoint address only requires testing during JSR (and RTS) and may be implemented such that no additional legacy bus cycles are required. A positive match invokes hardware acceleration. This speeds multiplication and trigonometric functions by at least a factor of 100. Overall performance will be far less pronounced but it is likely that numerically intensive programs may run faster - despite the multiple impediments of emulation and bus synchronization. As suspected in the DTACK Grounded newsletter, issue 2, Microsoft's 6502, 40 bit floating point multiply routine does not vary across host. As noted in issue 2 and 3, patching one or more float routines is relatively trivial. Unfortunately, other formats, such as Acorn's 48 bit floating point format, may also be of interest. Fortunately, common cases can be handled with one firmware implementation. Firmware in the AVR microcontroller may seek an optional I2C boot ROM. If found then 6502 code may be copied to RAM and executed. From here, it is possible to identify the legacy host (Acorn, Apple, Atari, Commodore, other), set suitable caching and breakpoint parameters and then jump to the 6502's usual reset vector. This arrangement avoids placing an exception case inside the fetch-execute cycle which would otherwise be required to make the serial boot ROM memory mapped and executable.
It is possible to extend this system much further. In particular, it is possible to bootstrap away from a legacy host. This is easiest with a card bus system with an address space which is larger than 64KB. Starting with a suitably large amount of RAM (less than 16MB), AVR microcontroller firmware, a hacked up version of CC65 and productivity tools on the legacy host (text editor, BASIC), it is possible to implement a 6502 assembler which understands 32 bit opcode prefixes and uses 32 bit memory. From here, it is possible to self-host a C compiler and linker which runs from the legacy host's prompt. From here, it is possible to statically link a Contiki environment suitable for development. This includes web browser, file browser, text editor, desktop utilities, shell, scripting language, build system and compiler. So far, we have all of the flakiness of shared memory, co-operative multi-tasking and it is likely to be in 40*25 PETSCII (or similar). This is about midway between a Commodore 64 and Amiga Workbench 1.3. However, it is self-hosting and more than sufficient to bootstrap to IDE, VGA, 100Mb/s Ethernet and maybe a NetBSD kernel or suchlike. A multi-core system with crossbar memory is also possible. This can be shrunk to credit card size, scaled to a rack or further developed to contemporary standards. In particular, all variants may be implemented with the same version of microcontroller firmware and the same version of boot ROM.
Technically, this can be built out from a stock Arduino Uno or Arduino Nano with nothing more fiddly than 0.1 inch strip board, DIP sockets and edge connectors. However, the 16MHz or 20MHz section would be significantly more reliable if it was smaller, closer together and used shielded, multi-layer PCB. And it would be more likely to do anything if it was designed by a hardware expert. My contribution to hardware design is a state machine which may be entirely discarded. It has an initial state which allows interrupt polling and optional crossbar arbitration. Using three control lines, it is possible to ripple through six states to set a 16 bit address and write one byte. An overlapping six states allow an arbitrary one byte read. By backtracking through the states, it is possible to implement sequential read, sequential write and atomic read-modify-write. This can be implemented with one 74x138 demultiplexer chip and multiple 74x374 tri-state latches which are set when exiting a state. This is for the optional legacy host bus. One additional control line, one additional 74x138 demultiplexer and three or more latches provide a larger address space which does not require synchronization with the legacy bus. Using prefix opcodes, it is possible to access this bus as a linear address space.
It is possible to implement 6502 hardware emulation on a microcontroller which does not require I/O multiplexing. However, the cost and performance improvement is marginal. It is also significantly less applicable to casual users, such as the typical Arduino experimenter. An ATMEGA328P has an 8 pin I/O port, a 7 pin I/O port and a 6 pin I/O port. It is commonplace for pins to operate in one of five separate modes. In particular, 2 pins connect to a crystal oscillator, 4 pins are required for reprogramming and 2 pins may be used with I2C devices, such as boot ROM, thermometer and clock. For my purposes, 3 or 4 pins can be used as multiplexed control lines, 1 or 2 pins are unused while the 8 pin I/O port multiplexes bytes of address and data. This is routed to the legacy host, on board RAM or a card bus, as required. Hardware profile (socket, boot ROM, thermometer, clock, RAM, cores, crossbars, card slots) may differ significantly but all implementations use the same firmware.
I'll finish with a few questions.
Are there any architectural features which should be added, changed or removed? Klaus Dormann's 6502 emulation will run with minor changes to the memory interface. From that, you can tinker endlessly with extra instructions and registers. However, is there any essential hardware which is missing, such as UART, buttons or lights?
Could someone design a board? I can provide a much more detailed specification and I'm willing to pay for design and finished units.
Does anyone require 64 bit long long int support in CC65 or TinyC? For example, for Year 2038 compatibility?
I started with the 65CE02 instruction set. 65CE02 can be concisely described as a superset of NMOS 6502 with bit operations and an additional index register called Z. I thought that 65CE02 was a good basis for a RISC extension because A, X, Y and Z logically map to 2^R registers commonly found in a RISC register file. I am willing to be corrected if this is erroneous.
I discarded the instructions which make 65CE02 useful as a microcontroller in, for example, a keyboard. I replaced them with more useful instructions for 64 bit computing. The base 65CE02 instruction set is otherwise unchanged, although it is moderately escaped. Specifically, 65CE02 bit flip instructions xxxx0111 (0x?7) are now SIMD prefix instructions and 65CE02 bit test instructions xxxx1111 (0x?F) now indicate two byte register-to-register RISC instructions. Potentially, the RISC instructions could be an exact replica of ARM Thumb instructions, with the exception that instructions would not be necessarily word aligned - and the first two nybbles would be swapped. At a tentative stage of development, it is significantly easier to implement a four octal digit encoding which allows operations of the form R5=R6*R7. To reduce encodings, commutative operations, such as addition, multiplication, AND and OR may share opcodes such that an operand may be doubled or squared at the expense of uninteresting logical operations.
Opcodes 0x07, 0x17, 0x27 and 0x37 provide prefixes for a smattering of 8, 16, 32 and 64 bit SIMD operations on A, X, Y and Z. These SIMD prefixes may be combined with RISC instructions. Opcodes 0x87, 0x97, 0xA7 and 0xB7 provide prefixes for the same smattering of 8, 16, 32 and 64 bit SIMD operations on an alternate set of registers which I tentatively name C, I, J and K. These prefixes also permit a second set of RISC SIMD opcodes. For example, operations with saturation arithmetic. (As stipulated by Edson de Castro as part of the Data General Eclipse 32 bit extension, all of this can be achieved without mode bits. For more detail, see The Soul of a New Machine by Tracy Kidder. Perhaps William D. Mensch, Jr.'s 16 bit extension would be less hated and the fabled 32 bit Terbium extension would be more tractable if similar practice was observed.)
The astute will notice that I have only defined half of the prefix instructions. The remainder are reserved for future expansion. The most obvious extension is to double the number of registers. However, this obtains diminishing returns, especially when zero page is available. It is possible to define 128 bit, 256 bit, 512 bit and 1024 bit operations. This is not completely outlandish. Intel and Xtensa already have 512 bit hardware, a GPU with 1024 bit bus is available and ARMv8 specifies 2048 bit SIMD. Alternatively, unused prefixes can be used to provide more opcodes. Most obviously, 32 bit and 64 bit floating point operations within the main registers. Other candidates for floating point precision include 16 bit, 40 bit, 48 bit, 80 bit, 96 bit and 128 bit.
Processor architecture address-space varies considerably and this is complicated by architecture extension, vendor page-banking schemes and considerations for privilege protection and multithreading. A processor architecture typically defines a subset of:-
- Program segment with or without shared libraries.
- Working segment, such as 6502's zero page.
- Data segment with or without shared memory and/or memory mapped files.
- Constants segment.
- One or more stack segments.
- I/O segment.
- Internal bus for the purpose of poking system innards.
From my experiments, I have found that it is possible split and grow program, data and stack segments using unrelated techniques. Stack segment may exceed 16MB with no ill effect. Data segment requires architecture extension. Program segment growth can be mostly solved by using longer branches and jumps. The awkward case is a data segment reference to program segment. A fairly general case is C++ vtable. As noted by Kerringham and Pike, a related problem occurs with device drivers. In this case, it is possible to stretch a 16 bit pointer by 3 bits or more and implement a prescaled indirect jump to an 8 byte, 16 byte, 64 byte or 256 byte boundary. This alignment is only required on targets referenced from data segment. Regardless, with very little modification, it is possible to run a 6502 binary which may 512KB - or considerably larger - without the use of overlays or page-banking techniques.
I'm trying to keep pointer bit shifts to a minimum because I intend to implement this on an AVR microcontroller. I believe that I am using the same model as Klaus Dormann. Klaus Dormann is quite a luminary among 6502 enthusiasts; moreso than most of the 6502's original design team. Klaus Dormann is best known for an extensive 6502 test suite and a 6502 emulator written in AVR assembly. The emulator is available on standard terms for non-commercial use. On a 20MHz ATMEGA328P, emulation runs at approximately 1MHz. There is a minor hobby of overclocking ATMEGA devices to 24MHz. This would provide approximately 1.2MHz emulation, although this is strictly outside of specification. Unfortunately, this is an optimistic case. An ATMEGA328P is the main component of an Arduino Nano and its $5 clones. These boards are typically underclocked to 16MHz to allow use with an unregulated lithium battery. (In this configuration, it is strongly advisable to avoid execution of the HCF opcode.)
After several months of consideration, I remain somewhat awed by the mind-bending insanity of implementing one processor architecture directly on another. For much of the process, it requires holding two disperate register sets in mind while implementing an almost bit-perfect mapping between them. AVR is almost a strict superset of PIC and Z80. Therefore, the task is fairly similar to writing a 6502 emulator in Z80 assembly. Furthermore, Klaus Dormann achieved this with almost clock-cycle accuracy. That's impressive.
I'm attempting the lazier task of writing an implementation in C and letting a compiler do the majority of the work. Unfortunately, the output of avr-gcc is a constant stream of WTF. Compilers for other microcontrollers are worse. Gotchas include the inability to allocate 32 bit or 64 bit globals correctly. Thankfully, a processor register context can be allocated within main() - assuming that the compiler didn't optimize away main(). Hmm. Perhaps sidestepping avr-gcc is the most sensible option.
If you don't care about SIMD on legacy instructions, don't care about commercial use and you're willing to write AVR assembly, more than 90% of an ATMEGA328P's EEPROM is available to extend the 6502 processor architecture without affecting the performance of core instructions. If you don't care about performance, do care about your sanity and you are favorable to the Lesser GNU Public License, it is possible to implement some or all of the architecture using C or C++ while using any desirable permutation of Arduino development environment, library, bootloader and run-time. If you find the Arduino IDE too patronizing, a Makefile is widely available to compile a binary, strip a binary, deploy a binary and verify a binary. This can be amalgamated with a Makefile for testing and archiving. From here, you're only a few steps away from abandoning Arduino libraries and Arduino boards. You'll be using a USB-to-SPI programmer on an Arduino clone board which is supplied blank without the Arduino bootloader. Then you'll be soldering an oscillator to a DIP-scale AVR and improvising a 6 pin ICSP connector from scrap wire. You may develop a one AVR per week habit. Maybe more. But you're in control. You're not an addict. You can quit at any time. Eventually, you'll join a support group, such as forum.6502.org. ("Hello. My name is ________ and I am a 6502 on AVR emulation addict.")
Anyhow, Arduino is a significantly more inclusive proposition and especially so given the wide distribution of Arduino hardware and support. A diligent and not particularly smart 14 year old could probably add an opcode to a working implementation. Speaking as a former diligent and not particularly smart 14 year old, it would have been an *amazing* opportunity to have a fully extensible 6502, written in C, running at a reasonable speed.
#include <rant/kids-today>
#include <rant/both-ways/snow>
#include <rant/both-ways/uphill>
#include <rant/lawn/get_off>
Getting back to 16 bit data segments, it is possible to connect a $2 microcontroller to a bank of memory and allocate 64KB data segments on 64KB boundaries. This requires no bound check and no exception processing. This saves considerable overhead when implementing a 64 bit processor architecture on an 8 bit microcontroller while allowing 6502 programs to each run in a separate memory space. A read-only program segment may be unrestricted size. Bound check is only required for branch or jump which crosses page boundary. Likewise, stack segment only requires bound check when crossing page boundary. Meanwhile, data segment and zero page never require bound check. These memory interface routines may be selectively compiled with an interpreter for 6502, 65C02, 65CE02, 65816 and the extensions of your choice. In my case, that's a 16 bit, 32 bit and/or 64 bit RISC extension. Unrolled loops for 64 bit SIMD barely fit into the 32KB EEPROM of an ATMEGA328P. If a 32 bit extension is sufficient, there is probably enough space to include a 6809 or Z80 interpreter. Or floating point instructions. Or perhaps some cryptography instructions.
While I'm vaguely competent at writing C and soldering oscillators, I doubt that I'd ever get an AVR running in a genuine 6502 or 6510 socket. The first hurdle is a lack of suitable test equipment. Regardless, instruction set extensions overlap with a project of more general interest. Specifically, for applications which are not performance sensitive, an AVR to 6502 adapter board would be cheaper than a 65816 to 6502 adapter board. It also has the intriguing possibility that extra functionality can be added. This includes wider registers, hardware multiply and native support for SWEET16.
Unfortunately, it also comes with a major disadvantage. Fully synchronous bus operation requires a three clock cycle busy loop. In a legacy system with a 2MHz synchronous bus and 6845 video adapter (or similar), a binary which doesn't look too closely at flags held on stack will run at a maximum of 40% efficiency. This is likely to break any tape or disk storage which uses timing loops. It also borks some games. However, it is sufficient for most productivity software, music software, almost anything written in BASIC and much more besides. Some efficiency gains can be made by caching zero page and/or stack. This reduces the number of bus synchronization cycles. However, eliminating writes to zero page breaks one of Commodore's incompatible page banking schemes. Likewise, eliminating writes to stack may break stack introspection via the X register. Fortunately, this is a rare case. In general, the maximum aggressiveness of caching is determined by host and application. At the expense of minor compatibility, it is possible to exceed Klaus Dormann's current results when a 20MHz AVR microcontroller is socketed in a legacy host.
In particular, significant gains can be achieved by placing a breakpoint on a legacy host's integer and/or floating point multiply routines. The breakpoint address only requires testing during JSR (and RTS) and may be implemented such that no additional legacy bus cycles are required. A positive match invokes hardware acceleration. This speeds multiplication and trigonometric functions by at least a factor of 100. Overall performance will be far less pronounced but it is likely that numerically intensive programs may run faster - despite the multiple impediments of emulation and bus synchronization. As suspected in the DTACK Grounded newsletter, issue 2, Microsoft's 6502, 40 bit floating point multiply routine does not vary across host. As noted in issue 2 and 3, patching one or more float routines is relatively trivial. Unfortunately, other formats, such as Acorn's 48 bit floating point format, may also be of interest. Fortunately, common cases can be handled with one firmware implementation. Firmware in the AVR microcontroller may seek an optional I2C boot ROM. If found then 6502 code may be copied to RAM and executed. From here, it is possible to identify the legacy host (Acorn, Apple, Atari, Commodore, other), set suitable caching and breakpoint parameters and then jump to the 6502's usual reset vector. This arrangement avoids placing an exception case inside the fetch-execute cycle which would otherwise be required to make the serial boot ROM memory mapped and executable.
It is possible to extend this system much further. In particular, it is possible to bootstrap away from a legacy host. This is easiest with a card bus system with an address space which is larger than 64KB. Starting with a suitably large amount of RAM (less than 16MB), AVR microcontroller firmware, a hacked up version of CC65 and productivity tools on the legacy host (text editor, BASIC), it is possible to implement a 6502 assembler which understands 32 bit opcode prefixes and uses 32 bit memory. From here, it is possible to self-host a C compiler and linker which runs from the legacy host's prompt. From here, it is possible to statically link a Contiki environment suitable for development. This includes web browser, file browser, text editor, desktop utilities, shell, scripting language, build system and compiler. So far, we have all of the flakiness of shared memory, co-operative multi-tasking and it is likely to be in 40*25 PETSCII (or similar). This is about midway between a Commodore 64 and Amiga Workbench 1.3. However, it is self-hosting and more than sufficient to bootstrap to IDE, VGA, 100Mb/s Ethernet and maybe a NetBSD kernel or suchlike. A multi-core system with crossbar memory is also possible. This can be shrunk to credit card size, scaled to a rack or further developed to contemporary standards. In particular, all variants may be implemented with the same version of microcontroller firmware and the same version of boot ROM.
Technically, this can be built out from a stock Arduino Uno or Arduino Nano with nothing more fiddly than 0.1 inch strip board, DIP sockets and edge connectors. However, the 16MHz or 20MHz section would be significantly more reliable if it was smaller, closer together and used shielded, multi-layer PCB. And it would be more likely to do anything if it was designed by a hardware expert. My contribution to hardware design is a state machine which may be entirely discarded. It has an initial state which allows interrupt polling and optional crossbar arbitration. Using three control lines, it is possible to ripple through six states to set a 16 bit address and write one byte. An overlapping six states allow an arbitrary one byte read. By backtracking through the states, it is possible to implement sequential read, sequential write and atomic read-modify-write. This can be implemented with one 74x138 demultiplexer chip and multiple 74x374 tri-state latches which are set when exiting a state. This is for the optional legacy host bus. One additional control line, one additional 74x138 demultiplexer and three or more latches provide a larger address space which does not require synchronization with the legacy bus. Using prefix opcodes, it is possible to access this bus as a linear address space.
It is possible to implement 6502 hardware emulation on a microcontroller which does not require I/O multiplexing. However, the cost and performance improvement is marginal. It is also significantly less applicable to casual users, such as the typical Arduino experimenter. An ATMEGA328P has an 8 pin I/O port, a 7 pin I/O port and a 6 pin I/O port. It is commonplace for pins to operate in one of five separate modes. In particular, 2 pins connect to a crystal oscillator, 4 pins are required for reprogramming and 2 pins may be used with I2C devices, such as boot ROM, thermometer and clock. For my purposes, 3 or 4 pins can be used as multiplexed control lines, 1 or 2 pins are unused while the 8 pin I/O port multiplexes bytes of address and data. This is routed to the legacy host, on board RAM or a card bus, as required. Hardware profile (socket, boot ROM, thermometer, clock, RAM, cores, crossbars, card slots) may differ significantly but all implementations use the same firmware.
I'll finish with a few questions.
Are there any architectural features which should be added, changed or removed? Klaus Dormann's 6502 emulation will run with minor changes to the memory interface. From that, you can tinker endlessly with extra instructions and registers. However, is there any essential hardware which is missing, such as UART, buttons or lights?
Could someone design a board? I can provide a much more detailed specification and I'm willing to pay for design and finished units.
Does anyone require 64 bit long long int support in CC65 or TinyC? For example, for Year 2038 compatibility?