I am horrified that I am repeating a pattern which has ongoing for at least a decade. Specifically, newbie creates account and announces an architecture extension which is never completed or commercialized. This trend was old in 2007. But do we know what happens to these efforts? Is it compiler support? Operating system support? Poor gains? Complexity of the task? Market acceptance? I was not aware of all variants given in references but there may be some omissions. Specifically, the MyCPU Project, Gigatron and 65j02 may be of interest.
Other ProjectsThe MyCPU Project is gorgeous. I wish that I invented it. It is an extensible rack design implemented with logic chips on the smallest 160mm*100mm EuroCards. It is loosely based upon 6502, although not binary compatible. I suspect that flag handling is incompatible. Regardless, there is a fork of TinyC and, therefore, for practical purposes, compiler support is marginally behind 6502. The CPU requires home PCB etching of five cards. Memory, serial, VGA and Ethernet require further cards. A full system is sufficient for GUI web browsing via VNC client. Due to the arrangement of the small cards, there is one 8 bit data path through the CPU. This may explain why some instructions require up to 10 clock cycles rather than a 6502's maximum of seven clock cycles.
The Gigatron is even more loosely inspired by 6502. Unfortunately, the Gigtron is a veritable set of architectural dead-ends including a Harvard architecture with an 8 bit data segment and read-only 16 bit program segment (of which 8 bit is for opcode and 8 bit is for operand); a branch delay slot; and a completely borked I/O implementation, which is lampshaded with an extension unit called Pluggy McPlugface. The least 6502 feature is the X:Y+ addressing mode which indexes an upper:lower memory address with post increment only on Y. I strongly suspect this addressing mode is used to implement VGA output. The Gigatron omits 6502 stack and interrupts. (The method to implement rastering without interrupts around an interpreter with string handling is left as an exercise.) Most impressively, the single board can be soldered from parts in 2.5 hours, it runs at more than 6 MIPS and it is supplied with a driving game which is similar to Namco's Pole Position.
I'm not sure about the details of the 65j02. I am under the impression that Jeri Ellsworth shipped an FPGA implementation which contained a (presumably) clock cycle accurate Commodore 64 emulation. I'm particularly irked about this project because it occurred to me many years ago that a legacy games console could be process shrunk until it could be contained within a joystick and, therefore, the lead from the "joystick" would be something akin to a UHF modulated NTSC signal. It was particularly galling to learn that 70,000 were sold on one day on QVC, 250,000 were manufactured in one production run and more than 500,000 were manufactured in total. I suspect many people would like to sell a similar quantity of 6502s, especially if it is possible to make a dollar or more on each one. Details are sketchy but I've been told by a friend that it is possible to expand one of these "joysticks" into a full Commodore 64 with cassette deck, multiple 1541 floppy drives, IEEE-448 printer and networking. From here, a Raspberry Pi can be used to persist 1541 protocol or bridge to a read-only FTP archive. Potentially, WebDAV allows read/write random access to a remote filesystem with encryption and authentication. This is less of a bodge than NFS, SMB or FAT32 block storage over USB, Ethernet or similar.
Most significantly, I've been told that the C64 Direct-to-TV has dormant 32 bit functionality. This is entirely plausible given that Jeri Ellsworth's previous work includes a multi-system emulator. Can anyone confirm this? Or is it the seed of a rumor of a clock cycle accurate 6502 with 32 bit extensions? Also, don't stop when you're winning! Why are there no sequels to this product? Jeri Ellsworth appears to be working on an augmented reality project. Perhaps the low energy requirement of wearable computing suggests a hacked-up, 32 bit, inspired-by-6502 design with FPU extensions?
My project appears to be almost a strict subset of André Fachat's 65002. I have also seen reference to 65GZ032 but it appears to have been run from a Yahoo mailing list which fell dormant before it was deleted. I was previously unaware that 65GZ032 was intended to run from a legacy host, like a Commodore 64. I am simultaneously encouraged and discouraged that I have vaguely conventional ideas but that my predecessors have not been wildly successful. I was previously unaware of 65Org32. I was only aware of 65Org16. Indeed, I laughed outrageously when I discovered 65Org16. You guys know how to have good, clean fun! The most common cause of processor architecture obsolescence is exhaustion of the address-space. So just make everything twice as wide! Who cares if that kills binary compatibility - or the ability to read bytes? But hey, you move from an architecture which defined less 2/3 of the instruction-space to an architecture which defines about 0.23% of the instruction-space. One fork filled this with two nybble fields allowing 16 position shifts and 16 sets of registers. For example, TXA now has 16 sources and 16 destinations. I'm disappointed that no-one suggested 16 stacks. Regardless, the 16 bit 70MHz FPGA implementation, suitable for VGA display, convinced me that a softcore 6502 extension is viable as a GPU.
RequirementsI'm still reading through references. The references are invaluable for gathering requirements. A subset of contributors would like to see a 6502 (or a extension thereof) decode a contemporary video format at a contemporary resolution. Technically, that requirement is a moving target which gets more difficult every year or so. Nevertheless, it is a sentiment which I share. I also believe that it is possible to implement a Java interpreter which outperforms x86, ARM and RISC-V.
I have noted two pairs of contradictory requirements and I may encounter more. The first pair of contradictory requirements is fast interrupts and atomic block copy. I suspect this is part of a larger problem where the usual two choices are synchronous block copy with all of the fun of cache churn or asynchronous DMA with all the fun of queuing and notification. For anyone who disdains a cache, churn isn't a consideration. The second pair of contradictory requirements is parsimonious, composable opcodes and hardware integer divide. (Float is a separate case.) One of the features which made ARMv1 notable among a long line of forgettable, interchangeable RISC processor architectures was the decision to implement addition, subtraction and multiplication but not division. The rationale was quite simple. Opcodes to perform addition, subtraction and multiplication can be used to perform arithmetic with arbitrary precision. However, this is not the case for hardware divide. Furthermore, division is the least common of the four operations and division by powers of two can be reduced to logical operations. In the miscellaneous case, where the least common operator has a denominator which is not a power of two, a subroutine (typically called _moddiv or similar) performs long division. The integer result and/or the remainder can be used after a single call. So, a=x/7; b=x%7 amortizes the work of one long division.
There are many suggestions for additional opcodes but relatively few suggestions for memory models, privileges or removing features. For example, I'm considering replacement of the absolute addressing mode with absolute+Z. The reason for this becomes apparent where the memory model is 32 bit (or larger). In this case absolute+Z should be regarded as Z+absolute where a 32 bit pointer (or larger) is supplemented with a 16 bit unsigned struct offset. Implementation divides into three cases. When emulating a legacy 6502 dialect with register Z, absolute addressing mode without Z offset must be honored. When emulating my desired dialect, temporarily clearing X, Y or Z allows X+abs, Y+abs or Z+abs to work as abs. (Likewise for alternate registers I, J and K.) When emulating a legacy 6502 dialect without register Z, hardware reset and/or executive code should clear Z. The previous two cases are now interchangeable and the most convenient one can be implemented.
Interupt Latency Versus Atomic Operations Versus Virtual Memory And Multi-CoreI don't plan any changes to the core 151 instructions of NMOS 6502 beyond Z+abs addressing. Other changes, such as longer jumps and branches, two sets of registers and SIMD, are only invoked with prefixes. However, consideration of opcodes in extensions, such as INW and DEW (16 bit atomic increment and decrement), led to a productive chain of thought. In my desired dialect, there are always 256 short memory references which are accumulator width. In 64 bit mode, this leads to a zero "page"/base "page"/whatever which is 2KB rather than the customary 256 bytes. In part, this provides compatibility with JVM opcodes. For example, 0x15 ILOAD, 0x16 LLOAD, 0x17 FLOAD, 0x18 DLOAD and 0x19 ALOAD would all map to LDA with base page addressing mode. (Java syntax looks like C++ without pointers but don't be fooled. A JVM distinctly looks like Forth targeting 6502 and 8086.)
In my desired 6502 dialect, which is derived from 65CE02, opcodes such as INW with base page addressing, are redundant. However, I was alarmed that I would be discarding atomic operations which are not present in the core 6502 instruction set. Indeed, there are no atomic operations in the core instruction set. Instead, there is a standard idiom, common across many 8 bit processor architectures:-
- Clear maskable interrupts.
- Do anything.
- Set maskable interrupts.
Ah, yes. For any code running outside of NMI, opcodes CLI and SEI are bookends for an arbitrary, Turing complete, atomic operation. That works for single-core, single-tasking, single-threaded, single-user code and can be stretched to a few cases outside of that. In the general case, a contemporary system requires a semaphore or mutex which conveys more than one bit of state without incurring the ABA problem (like x86). POSIX compatibility may also be desirable, although Linux pthreads demonstrate that implementation doesn't have to be particularly conformant.
My first idea to provide upward and downward compatibility is probably a dud. If it is possible to read or write a pointer with one instruction then it is more than sufficient to have a maximum of 16 instructions between CLI and SEI. Unfortunately, an alpha geek said this limit was idiotic. Indeed, it interacts badly with virtual memory and would probably require a four bit counter within a privileged flag register. Hey! I created four modal bits at once!
If we truly don't care about interrupt response time then it is possible to take an idea from the Transputer processor architecture and eliminate the modal interrupt bit. The Transputer architecture has a few good ideas although most of them are taken from the Burroughs B6000 architecture. A fundamental concept in a Transputer is that a straight run of code is implicitly atomic. This dovetails with use of an internal stack which may only be three elements deep. For a bottom priority task, preservation of the internal stack is not guaranteed after a control flow instruction. This is when hardware task switching occurs. Therefore, correct operation requires the internal stack to be exhausted or contain junk at the end of each instruction sequence.
There are multiple advantages if we apply this idea to 6502. Firstly, for 6502 emulation, there is a performance advantage because interrupt polling only occurs during control flow execution. Secondly, code density is improved because all instances of CLI and SEI can be discarded. Thirdly, we can discard the corresponding modal bit - or at least when using dialects of 6502 which don't have CLI or SEI. (Don't worry! We've still got the decimal modal bit!)
Fourthly, for 6502 emulation, it aids optimization of stack. [This explanation is quite long and can be skipped.] Take the example of a straight run of code to apply (a+b)&c to 32 bit values. In this example, a+b has to be evaluated in ascending byte order but the logical operation can be applied in either order. To minimize program size, memory cycles and therefore clock cycles, intermediate values would typically be held on stack. Therefore, the least significant three bytes of a+b would typically be pushed on stack as each byte is evaluated. When progressing from the arithmetic operation to the logical operation, the top byte remains in the accumulator. The least significant three bytes are subsequently popped as the logical operation is applied in descending byte order. If no interrupts (and no stack introspection) can be guaranteed during this sequence then intermediate values can be held in a shallow internal stack. Eight bytes would handle typical cases. This eliminates the memory cycles for the push operations and the corresponding pop operations. This approach still works if we have interrupts and stack introspection. However, if we receive an interrupt signal and then make the external stack consistent, interrupt processing is delayed by a variable number of cycles. It is possible to cache to stack pointer boundaries. However, it is more efficient to make the full depth of the internal stack available at the beginning of every interrupt level, at the beginning of every subroutine and at the beginning of every condition and loop. In common with a Transputer, a deferred interrupt dovetails with exhausting an internal stack after a straight run of instructions.
Fifthly, we may not have arbitrary, Turing complete, atomic operations but we retain useful operations such as linked list insertion and fixed length block copy (as an unrolled loop). Sixthly, if we count the failure to follow a conditional branch as a straight run of code, we get atomic operations such as test-and-set, compare-and-swap and variable length block copy (as an unrolled loop). Ignoring the significantly worsened interrupt response time, many of the disadvantages are currently hypothetical. For example, implicit atomic operations interact badly with NUMA, crossbar memory and virtual memory.
Stack Versus RegisterI am a fan of stack designs and single accumulator designs. However, it is undeniable that such designs get mired when they are process shrunk and/or memory grows. Why does this occur? In my experiments with processor design, I've found that there is a critical path running from accumulator to ALU and back to accumulator which is rarely less than 14 logic gates. 6502 meets this 14 gate criteria. 65816 does not meet this 14 gate criteria. Nor does Z80 or x86. RISC designs typically fail to meet the 14 logic gate criteria although this is not due to register width. One of the most obvious failings of a symmetric register file is that 2^R registers adds R logic gates to the critical path. In mitigation, the same principle applies to addressable memory. 1KB has 10 gates in the critical path. 1MB has 20 gates in the critical path. 1GB has 30 gates in the critical path. 1TB has 40 gates in the critical path. Process shrink makes this worse because it increases the relative latency - and a cache can only improve average case latency at the expense of worst case latency.
An absurd hypothetical example would be to place a 6502 directly adjacent to 1GB RAM. If these two units are interfaced with a trivial bank switching scheme then it is possible to take advantage of static column paging and pre-decode all address bits except the least significant 16. In this case, the latency of the ALU and memory are approximately balanced. However, if the two units are interfaced with a Commodore style bank switching scheme (where program and data accesses are automatically directed to separate banks) then all address bits may change with each access. In this case, the processor would have to be slowed to match the latency of the memory. It is possible to provide two sets of decode via dual port RAM. However, due to economies of scale, it is more cost effective to provide separate banks for program and data. This is commonly known as a Harvard architecture and has numerous disadvantages.
There is a qualitative difference when moving from 1KB to 1MB to 1GB to 1TB. One of these differences is that we can be more relaxed about the critical path of logic gates. This manifests as wider registers, more numerous registers and instructions such as multiplication. Obviously, this leads into Amdahl's law of diminishing returns. But don't act like 8 bit computing is the paragon of efficiency. Efficiency per transistor has been declining since Intel released the 4004 processor architecture. ("Four bits is enough to hold a decimal digit and much more besides! But you profligate whippersnappers use a whole eight bits when you only need one or two! If you keep wasting memory like this, I'll have to buy a whole 1024 bits!")
Prefix Instruction Versus Mode BitMore seriously, a consideration of Amdahl's law has softened my stance against the 65816's modal bits. My first impression was that an extra two bits of state when entering and exiting subroutines added a large amount of cognitive load. It hadn't occurred to me that an application may be written in a style where there is a default setting. For example, the accumulator may default to 16 bit mode and exceptions may be bookended with mode flips. Alternatively, an application may default to don't care and may be explicitly set before each use. That style of use is a net positive. In particular, it aids brevity without sacrificing throughput. However, several misgivings remain. In particular, some of the instruction density is gained through ambiguous encoding. People have exploited the differences between x86_32 and x86_64 to evade malware filters. I'm working on 6502_64 so that we can sidestep this type of technical debt.
An unambiguous encoding necessarily has a lower instruction density. While I also have misgivings about prefix instructions, there is an advantage to a stateless encoding where the size of the prefix plus opcode never exceeds the size of the data. In the case where accumulator width matches the width of the external data bus, an implementation may read two or more instructions concurrently. In the most optimistic case, where alignment is ideal, one of these is a prefix and one of these is an opcode. Essentially, 8 bit data should handled with 8 bit opcodes. Larger data should be handled with larger instructions. This is downward compatible to systems with the most narrow data bus. It also provides an advantage when handling small pieces of data. For example, the latency of 8 bit addition is less than the latency of 64 bit multiplication. Therefore, there may be an advantage to fit more 8 bit addition instructions into a hypothetical cache line.
An individual mode bit is great because it allows two contradictory semantics. However, mode bits tend to hang about in gangs. This leads to unbounded complexity. I believe that Edson de Castro's objection to mode bits was the cost of developing, testing and distributing microcode which would be bank switched by one or more mode bit inputs. Invariably, one mode will be used more frequently than another. However, making this stateful is only an advantage if one mode is the most common. For example, when using a 65816, it is an advantage to default to 16 bit operation. However, when handling 8 bit data, 16 bit data, 32 bit data, 64 bit data and possibly larger pieces of data, it is conceivable that no data size is used for the majority of operations. Effectively, there is no good default. Also, starting from 65816, which defines 254 opcodes in detail and the remaining two in outline, how would I specify 32 bit operation? Conceivably, I could define a small subset of the WDM prefix specify 8/16, 32/64, 128/256 or 512/1024 operation and then use the existing mode bits to specify finer detail. Or perhaps I could specify accumulator and/or index size concurrently and use the legacy short encodings to switch down to shorter sizes?
Actually, I've considered multiple definitions for the 65816's WDM prefix; some whimsical and some facetious. The best encoding that I've devised so far is for the WDM prefix to precede two 12 bit ARM Thumb instructions but redefine the Thumb 8/32 bit operations as 32/64 bit operations. This would allow short instructions which operate on the traditional register set and default to a stateful 8/16 bit encoding. It also allows longer instructions to statelessly specify 32/64 bit encoding and operate on an extended set of registers. It also retains 65C02 and 65816 binary compatibility while approaching the instruction density of Thumb.
32/64 Bit 6502 With Binary CompatibilityThere is suggestion that a 32 bit extension of 6502 does not have to be binary compatible. There is also suggestion that it is called ARMv1. This is the problem with 6502 architecture extension. It either has to extend/emulate/ignore the established 16 bit extension or it is yet another 32/64 bit stack/RISC chip - and competition is steep among RISC chips. I reversed an arbitrary design into 6502 and from that exercise, I've found that it is also possible to reverse various extensions of ARM and RISC-V into various extensions of 6502. The trick is to find a mapping of registers and opcodes. This is particularly easy if it is possible to define a hybrid instruction stream. The unusual history of 6502 made a notably sparse instruction set. I am unable to name any other general purpose processor architecture which did not define less than 2/3 of the instruction-space in the first iteration. Likewise, I am unable to name any other general purpose processor architecture which does not have a contiguous, undefined 1/4 of the instruction-space. This makes 6502 architecture extension particularly easy in a manner which does not apply elsewhere. Likewise, the short and long encodings of ARM and RISC-V provide encodings which do not exceed 1/4 of the instruction-space.
By default, ARM has four byte instructions. From ARMv1 to ARMv7, four bits of every instruction are a conditional execution field. These conditions are defined in pairs, such zero/not zero. One pair is always/never. "Never" is used for a 12 bit short encoding called Thumb. With some bit shuffling this can be mapped into 1/16 of the 6502's instruction-space. With the benefit of hindsight, RISC-V uses a two bit field to specify one 30 bit encoding or three 14 bit encodings. Therefore, RISC-V has 12 times as many short encodings compared to ARM Thumb. Regardless, RISC-V's long encoding can be mapped into a contiguous 1/4 of a 6502's instruction-space. Specifically, it fits into Group 3, Group 7, Group B and Group F. Using a hypothetical two modal bits and a modicum of bit shuffling, it is possible to execute an instruction stream which consists of:-
- 6502 instructions only.
- 6502 instructions and ARM short instructions.
- 6502 instructions and RISC-V long instructions.
- ARM short instructions and RISC-V long instructions.
Without mode bits, it is also possible to execute 65816 instructions and pairs of ARM short instructions. I don't recommend any of this an a practical exercise. However, if two or three hardware definitions are spliced together and an assembler is similarly spliced, the unholy amalgamation will execute the output of GCC, LLVM and TinyC. I mention this for multiple reasons. Primarily, it definitively answers a few questions. Yes, it is trivial to make a 32 bit extension of 6502. In particular, it is possible to have 6502 binary compatibility within ARM. Arguably, it has always been possible to maintain 6502 binary compatibility within ARM. No, it would not have been commercially successful.
There are further reasons to mention a 6502/ARM/RISC-V hybrid. If you cannot exceed this standard then don't bother. Like the video decode problem, the acceptable threshold increases every year or so. However, the quest for single threaded performance has led to designs where an increasing cluster of units run autonomously within their own light-radius. This includes multiple integer units, FPU, MMU, SIMD, crypto and GPU. Most ridiculously, this leads to duplicate circuitry and shadow registers to model the behavior of circuits running within the same CPU. I've found that it is possible to unify most of this cluster into one unit. The trick is to treat each unit as a separate architecture and find a mapping of registers and opcodes. In this case, the primary constraint is register width.
Maximum Number Of RegistersI hope that I've given a sound theoretical basis for a 32 bit processor to have multiple symmetric registers. There is also a theoretical basis for an upper bound which never exceeds eight registers. This is derived from the graph coloring problem. Technically, the British spelling should be used because it was first applied to the geographical divisions of England. However, I'm not a pedant.
There are numerous mathematical problems which have been unresolved for centuries. For example, Fermat's Last Theorem and the Canonball Conjecture (both solved by Andrew Wiles). Well, the graph coloring problem was solved in 1976 and published in 1977. That's incredibly poor timing because it occurred during the development of 6502 and Z80. It was also quite controversial because it was a computer assisted proof. The graph coloring proof states that contiguous, two dimensional regions on a plane never require more than four colors for them to be distinct. This is very, very important for programming because we never have to handle more than four intermediate values. How does that work? All unstructured programs can be implemented as block structured programs. In the worst case, unstructured bytecode can be executed with a block structured interpreter. All block structured programs can be drawn as a state machine on a two dimensional plane without any arrows crossing. (Calls between functions can be implemented with a dispatcher in the shape of a binary tree and all types on stack are fungible.) Given Appel, Kenneth; Haken, Wolfgang (1977),
Every Planar Map is Four Colorable. I. Discharging, Illinois Journal of Mathematics, a state machine on a two dimensional plane without crossing arrows is four colorable.
People argue about the merits of stack, single accumulator or more. Here we have a definitive answer. A statement written in infix notation with precedence rules, such as l=x*x+y*y, can always be reduced to stack operations without ambiguity. Furthermore, where the precision of a value exceeds the precision of an accumulator, the same stack can be used to hold intermediate chunks of data. So, one accumulator, one index register and one stack are sufficient to process one statement. Unfortunately, this is only one node of the state machine. An algorithm with multiple statements may require juggling four or more intermediate values. This can always be reduced to a maximum of four intermediate values. However, for brevity and clarity, or due to implementation quirks, it is not necessary to strictly adhere to this limit.
People disagree over this less frequent but essential case. Multiple registers and stack introspection both allow juggling of four or more intermediate values. Unfortunately, the threshold where each technique is most efficient may not be immediately apparent. For example, it may depend upon the efficiency of a given compiler and given algorithm, connectivity of register transfers, the width of the external bus, cycle time of main memory or the bit patterns of the opcodes used to select functions in a given ALU implementation. However, if addressing is 32 bit or larger then it is overwhelming likely that multiple registers are beneficial.
Even if you completely disregard the graph coloring proof and believe that five is the minimum, exceptions would be sufficiently rare to make stack operations preferable to the inefficiency of doubling registers. Unfortunately, there are some practicalities around indexing and stacks. Arbitrary data is sourced from main memory and modifies one or more accumulators. If these operations are conceptually 3-address operations applied to four accumulators then it is beneficial to provide an additional two registers. These registers would be general index registers into a common data segment or stack pointers into separate conceptual segments. However, the result may not be practical due to poor instruction density. Instruction density can be increased by adding deliberate asymmetry. Asymmetry may arise from 8086 source compatibility with 8080. For brevity of specification, correctness and throughput, binary compatibility with 6502 may be preferable. Asymmetry may require stack operations or additional registers to shuffle data. This requires seven or more general registers. Fortunately, register pressure can be reduced by not making stack pointers or program counter part of the general register file.
Stack pointers can be incremented and decremented without transfer to accumulator. Likewise, intermediate addresses can be computed in an accumulator and then transferred unidirectionally to an index register. Overall, a 3-address machine reaches an efficiency threshold when it has four accumulators, two index/temp registers and one or more stacks. A system with less registers may allow them to be used in multiple modes. For example, a 6502 index register is often used as a temporary register or stack pointer. This intentional conflation reduces cognitive burden, allows a parsimony of implementation and improves instruction density. Unfortunately, there are cases, which may occur once per statement or less frequently, in which this is a false economy. This is particularly true when considering transistor count of a processor and memory.
The ability to handle four or more intermediate values explains much of the lunacy in processor design. Notable examples include the 16 element stack window of TMS9900, register windows in SPARC, the conveyor belt concept of the Mill processor architecture and Intel's Itanium which had more than 2KB of state to context switch. It also explains why processor architectures take multiple positions. For example, the general register count in AVR and RISC-V may differ by a factor of two. Most significantly, ARM Thumb restricts access to eight general registers. There are many reasons for specifying too many registers. The most obvious case is a surplus of bits; typically to achieve 8 bit, 16 bit or 32 bit alignment. ("Hey! We've got a spare bit! Let's have two accumulators and two index registers!") This is why general purpose processor architectures always specify 1/2 of the instruction-space and typically specify more than 2/3. A confounding factor was MIL-STD-1750A which specified a processor with 16 registers. RCA1802 has 16 registers. MC68000 has 16±1 registers and a version was developed with IBM System 370 microcode, which also has 16 registers. SWEET16 has 16 registers where special registers, such as program counter and stack, backfill from R15. Many of the RISC designs influenced directly and indirectly by SWEET16 also have 16 registers and backfill from R15. This includes ARM.
If you want a futile exercise, try designing a symmetric, 3-address, 16 register processor architecture which uses 8 bit opcodes. It can be achieved by splitting the opcode into two or three fields and heavily prefixing the instructions. This includes defaulting to MOV and then prefixing any desired ALU operation. Understandably, the instruction density is terrible, in part due to the combinatorial explosion of prefix instructions. So, how have systems with 8 bit opcodes, such as x86, been sold as MIL-STD-1750A? Easy. Ignore the specification or get a waiver. Indeed. My results with processor design advanced considerably after I ignored MIL-STD-1750A.
My Processor ArchitectureMy first serious design was inspired by 6502, 68000, 88000 and some work with a quadtree video codec similar to HEVC. Specifically, the 2:14 split of STA:ALU operations in 6502 opcode Group 6 and Group D (ORA, AND, EOR, ADC, *STA*, LDA, CMP, SBC), 68000 symmetrical 2-address encoding and quadtree motion delta encoding inspired a 16 bit instruction format in which four nybbles may represent opcode or operand. In the preferred implementation, hexadecimal digits 0-B represent operand and hexadecimal digits C-F represent opcode. As an example, 0xF012, 0x0F12, 0x01F2 and 0x012F are distinct instructions which operate on R0, R1 and R2. In addition to aiding debug, this arrangement provides a practical quantity of 0-, 1-, 2- and 3-address instructions. It also provides a four operand, zero opcode encoding. I was quite pleased with this fusion of ideas. Despite being complete vaporware, it served as a basis to investigate numerous concepts including allocation of instructions in proportion to frequency of execution, constants in the instruction stream, illegal instructions, unaligned memory access using two separate instructions, multiple data stacks, stacks aligned with register width, biased stack pops, biased branches, atomic operations, SIMD, fused multiply-accumulate and fused compare-and-branch for the elimination of all user flags. (This was in ignorance of the MIPS technique of evaluating logical predicates.) I found that I could subdivide the fields into two inputs and two outputs (although it is most convenient for the fused compare-and-branch to use two "outputs"). In my enthusiasm, I would often define two or more functions to the same bit pattern. However, I have developed techniques to prevent duplicate allocation. The design was intended to minimize multiplexing and the latency of TTL loads. Indeed, I hoped to make a scalable 16/32/64 bit design using one, two or four identical boards with 3*32 pin connectors. (This is why I have such admiration for the MyCPU project with its five board CPU and 3*32 pin connectors.)
The ultimate failure of this design was poor instruction density. The pure load/store architecture allowed commutative operations to be reduced. This applies to ALU operations and conditional branches. However, I was unable to retain a balanced design and specify more than 6 bits of constant per 16 bit instruction. This is worse than the notoriously poor constants handling of a Transputer. Simultaneously, the mapping of opcodes from 6502 and 68000 was unfavorable. A further complication was the unconventional 4-address instruction which consumed 81/256 (31.6%) of the 16 bit instruction-space. Unfortunately, I could not find an operation which justified such allocation. I considered multiplication with 64 bit input and 128 bit output (with possible modal bits for signed/unsigned operation). I considered multiplication of the form R0=R1*R2+R3. I considered double MOV, line draw and bit rotation. Assuming such problems were solved, my vaporware would require five or more data paths to 64 bit registers. Whereas, Xtensa only requires three data paths to 512 bit registers. It is a terrible state of affairs when a person's vaporware is inferior to a shipping product.
My second serious design attempted to correct previous failings and also outperform ARM. I thought about this problem for too many years. One of the problems with processor design is that choices are relatively unconstrained. No individual choice is bad or fatal but if you say yes to everything then the result is an Intel Itanium. Never go full Itanium. Anyhow, the most significant change to my design was a more conventional arrangement of instruction fields encapsulated within a variable length instruction encoding. An instruction-space should be regarded as a binary fraction rather than a finite block. Eventually, I discovered that it is possible to specify 7 bit, 14 bit, 21 bit, 28 bit or larger instructions such that:-
- A maximum of 1/2 of the instruction-space may be specified for move instructions.
- A maximum of 1/4 of the instruction-space may be specified for jump and branch instructions.
- A maximum of 1/128 of the instruction-space may be specified for privileged instructions.
- One byte instructions may be 8 bit, single or dual accumulator operations without post-increment.
- Two byte instructions may be 8/16 bit, register file operations with post-increment.
- Three byte instructions may be 8/16/32/64 bit, integer/float, 3-address operations with post-increment and fused accumulate.
- Five byte instructions may be 128 bit operations or larger.
- Longer instructions may use Fibonacci lengths and therefore six byte instructions may be disallowed.
- Commutative operations may share opcodes.
- All relevent instructions may have automatic SIMD.
- Control flow and privileged instructions may be one contiguous group of opcodes. This provides more options for scaled-out implementations.
- Opcodes may be otherwise patterned to facilitate software compression, FPGA implementation or parallel simulation on GPU.
- A breakpoint instruction may scale to the size of any other instruction.
This looks competitive with 8 bit instruction sets, 16 bit instruction sets and RISC; in particular ARM, RISC-V and Xtensa. I implemented this is C and ran it on a bottom-of-the-range Raspberry Pi (700MHz Broadcom ARMv6). Results were mixed.
A design loosely inspired by 6502 and ARM running on ARM should run quite fast. Unfortunately, from fuzz testing and benchmarking, I found that instruction execution required an average of 134 clock cycles. That was unexpectedly slow. Some of it is instruction cache miss. Much of it is derived from promoting all instructions to 28 bit representation and then performing a large number of fancy-pants tests which are typically false. Despite the inefficiency of, for example, 21 bit left shift on ARM, this virtual machine (with 0.7% efficiency on a 700MHz ARM credit card computer) executes double precision floating point addition faster than an 8MHz 68000 executes 16 bit integer addition. However, it is undeniable that my instruction format is the major impediment to this process. I got too smart. I've gone from one extreme to the opposite extreme and neither is efficient.
For my third serious design, I'm now replicating AMD's trick of incorporating 29050 (RISC) into K5 (x86). However, in my case, I'm taking the best parts of my previous designs and incorporating them into 6502. This process retains eight 64 bit registers, 3-address operations, fused multiply-accumulate, data stack, pre-scaled branches and it retains similar instruction density. Losses are minor. This includes arbitrarily deep nesting of exceptions and interrupts (which were tied to SIGUSR1 and SIGUSR2), Fibonacci lengths, post-increment, LEA [Load Effective Address], privileged instructions, general fused accumulate and my peculiar fused compare-and-branch. Gains are significant. This includes the speed of 8 bit instruction decode, zero page and byte-by-byte arbitrary precision arithmetic in binary or decimal. Two accumulators and six asymmetric index/temp registers isn't perfect but it should be sufficient for common cases. Most significantly, core functionality will work with existing compilers, operating systems and applications.
You may be surprised that I have discarded a privilege system. In an arbitrary design, it is possible to meet the Popek and Goldberg virtualization requirements by having a clear division of privileged operations. For example, in 68010, the tweaked version of the 68000 which meets such requirements, the flag register is split into an 8 bit unprivileged section and an 8 bit privileged section. Attempts to read or write the privileged section may be halted or intercepted. In an extended 6502 emulator, there is ultimately a modal dialect of one or more bits. However we never require an explicit privilege bit because:-
- Dialect specifies opcodes.
- Opcodes specify register visibility.
- Dialect plus opcode specifies addressing mode.
- Addressing mode specifies address-space.
Therefore, we have cases which may include:-
- 6502 binary may run in one 64KB segment on a 64KB boundary. Importantly for emulation speed, this requires no bound check.
- 6502 binary may run in multiple 64KB segments. For example, program and data may be split. This requires no bound check.
- 65816 binary may run in one 16MB segment on a 16MB boundary. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 16 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 24 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 32 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may have unrestricted access to memory and I/O. This requires no bound check.
- All shared binaries may be placed in one contiguous, growable, read-only segment. This requires coarse bound check prior to indirect jump.
None this requires an explicit privilege system. Regardless, it is possible to have text or graphical applications which are protected from each other while also having one or more high bandwidth applications, such as a video player. The intention is to have the stability of 8086 Protected Mode and the creative anarchy of Amiga Workbench. If the anarchy becomes unmanageable then it is possible to wrap everything inside a 32 bit, 40 bit or 48 bit address-space. Be warned!
ImplementationThe intention is to make the project affordable and accessible. Therefore, anything more niche than Arduino is not a priority.
- The minimum implementation is software only running from POSIX shell. (I investigated the possibility of implementing a virtual UART to XWindows. Unfortunately, raw XWindows protocol is more difficult than six dialects of 6502.)
- The software desktop implementation should compile without POSIX and run raw on a microcontroller. Initial targets are ARM, AVR and 8051. In particular, WCH's CH330 - the infamous 3¢ microcontroller. It might be heresy to suggest this but 8 bit 6502 can be used to implement 64 bit 6502. I have devised a state machine and arrangement of 74 series chips which allows 11 or 12 pins to access an external memory bank. It was initially devised for AVR to run 6502 in a 16MB address-space or larger. However, it could equally be connected to 6522 or FPGA running as stock 6502 plus 6522. Or used for an entirely different purpose.
- While hunting a bug, I inadvertently wrote enough library to replace Arduino digital I/O. Therefore, my AVR software is 100% Arduino compatible, with or without the official Arduino LGPL library. Furthermore, my version is smaller, faster and MISRA compliant. Anyhow, minimum hardware cost is USD5 for clone Arduino Nano, optional USD7 to program a blank Nano via SPI and USD20 for components.
- I hope to have USB and/or SPI AVR in a 6502 socket with jumpers or variants for 6510 and 65816. I believe it is reasonable to price this below the cost of an official Arduino Nano. I am extremely willing to outsource design and manufacture if I recoup expenditure, plus or minus USD100 or so. For anyone who has an undemanding application - or conversely - for anyone who desires a custom opcode, this may be preferable to genuine 6502 hardware or FPGA.
- In addition to AVR or similar being used in a legacy 8 bit host, identical firmware may be used in the motherboard of a standalone card system. In particular, a One Time Programmable microcontroller with 6502 emulation may be paired with a reflashable boot ROM which cannot be updated from the booted environment.
- C implementation facilitates CUDA implementation.
- C with no forward references facilitates port to VHDL.
Anyhow, at present:-
- 150KB C with all possible warnings enabled produces clean compile on GCC for ARMv6 and Atmel's fork of GCC for AVR.
- Static analysis indicates that nested macros are too long. That tends to happen when specifying a processor.
- Leading parameters on all major functions share name and type. This greatly reduces parameter shuffling. For example, I wish to avoid circumstances where foo(a,b,c) calls bar(a,c). In this case, it is typical to push b on stack, copy c, call bar(a,c), return and pop b. This is worsened by return types which may differ in size.
- Debug build has POSIX printf() statements after executing each instruction.
- Vectors are not implemented and therefore I haven't tied POSIX SIGUSR1 and SIGUSR2 to NMI and IRQ. (From previous work, it is great fun to single-step, invoke a POSIX signal and watch a nested interrupt.)
- All instructions are restartable and therefore all implemented opcodes would correctly handle bus error if it had a working vector.
- I believe that I have defined a case statement for every official opcode of NMOS 6502, CMOS 6502, 65816 and 65CE02. Common unofficial opcodes, such as LAX, are implemented. Block copy is unimplemented.
- Some flags are present on some opcodes. This hasn't been a priority because the legacy instructions may be swapped for another implementation.
- Common opcodes produce sane output. The remainder is not a priority because I'm exploring other functionality.
- Addressing modes are largely untested. In previous designs, I've had numerous errors with stack and therefore this part is tested more thoroughly.
- Macros for SIMD correctly generate rolled or unrolled loops according to compiler parameters. This allows a smaller but slower implementation.
- Extensive use of #ifdef allows 8/16/32/64 bit SIMD to be excluded from build. Likewise for unused opcodes and addressing modes. (64 bit registers may be present but dormant.)
- SIMD is applied wherever it has been tested. Unfortunately, disassembly indicates that automatic SIMD everywhere is too slow. Therefore, I will be changing implementation such that unprefixed legacy opcodes only update the bottom 8 bits of a register. (Or 16 bits, maybe, for 65816.) I hoped to have full symmetry rather than an 8:56 or 8:8:48 bit split in the registers. If my first choice was to use an 8:8:16 bit split then I'd use 68000.
Unfortunately, avr-gcc generates horrendous code. For example, assume 32 bit variables and a reasonable compiler optimization setting. In this case, something akin to (0xFFFF0000L & segment) | (0xFFFF & (reg[2] + abs)) would be expected to generate 16 bit addition and 16 bit move. Unfortunately not. Indeed, the most pressing insight I've gained from disassembling the output of 8 bit C compilers is that every variable and intermediate value should be tagged with min, max, offset and stride. That would be sufficient to identify 16 bit addition and eliminate all logical operations. Instead, I have to restate all similar code in manner which does not introduce endian problems. (I used logical expressions to hide endian problems while allowing optimization of 24 bit masks.)
Interrupt HandlingI'm concerned about interrupt speed. This is unrelated to register count, synchronous block copy or hardware divide (integer or float). Wider vectors will slow interrupt vector read at legacy frequency and data bus width. Assuming data bus width is half the vector width, this is reduced to the familiar two memory read cycles. After that, it falls behind. Following a wide vector implies that execution is not in an 8 bit mode (or 65816 mode). In the trivial case, it should be possible to indirect via 0xFFFC and switch down to a 16 bit or 24 bit address-space; possibly automatically based upon address range. For any data bus width, this requires more main memory reads cycles than 6502. It also requires a dedicated interrupt stack. Otherwise, an attempt to push program counter and interrupt flags onto user stack may incur page fault and initiate a nested interrupt. This only worsens interrupt response time and jitter.
It is possible to make interrupt handling modal. For example, interrupts may default to 65C02 mode. From here, interrupt handling could be modified to the slightly expanded 65816 mode. This is preferable for a legacy system or minimal system because it maintains maximum speed and only requires one ROM within a legacy address-space. However, this arrangement may not provide any mechanism to break out of legacy modes. It may be possible to default to 64 bit vectors from a base address outside of the legacy 64KB/16MB address-space and then statelessly or statefully switch down to legacy vectors (where mappings exist). In some cases, partial decoding of a small ROM would be required to direct interrupts into the legacy address-space. However, we have the opposite case when the bottom 16MB is not mapped to a legacy host. In this case, we lose a maximum of 1/64 of larger address-space. Ordinarily, it would be 16MB out of 4GB. This is 1/256. However, a flexible mapping would be:-
- 0x00000000 to 0x00FFFFFF: Modal registers which are invisible in most modes of operation. This is beneficial for emulation speed and hardware implementation.
- 0x01000000 to 0x0100FFFF: Optional legacy 6502 address-space.
- 0x01000000 to 0x01FFFFFF: Optional legacy 65816 address-space.
- 0x02000000 to 0x03FFFFFF: 64 bit vector base and boot firmware.
- 0x04000000: First suitable address for RAM is 64MB.
- 0x20000000: First unshadowed address suitable for DDR3 RAM may be 512MB or higher.
For systems with 64MB RAM or less, addressing should begin at 0x04000000. In other arrangements, RAM may begin at a small set of alternative addresses or a maximum of 64MB is overshadowed. This is minor compared to x86 where it is typical to lose 3/8 of a 32 bit address-space to PCI mapping and therefore up to 1.5GB RAM may be unavailable to 32 bit software. Regardless, mutually incompatible modes of operation are satisfied with modal registers in a chunk of address-space. This is not ideal but mode register state meets the Popek and Goldberg virtualization requirements without requiring special opcodes. Furthermore, this practice may be nested. For example, it is possible to place one or more instances within a 40 bit address-space on 4GB boundaries.
After placing programs in containers and denying access to I/O, programs will require communication with an operating system and possibly each other. I'd like the simplicity and stability of a program which runs within a 64KB address-space. However, I regularly edit text which is larger than 64KB. These are mutually conflicting requirements. Fortunately, it is possible to satisfy both requirements by providing an in-memory key/value store. In this case, individual lines of text may be stored external to a program. This technique increases memory utilization without incurring long-term memory leak or compromising integrity. It can also be used to selectively share information. For example, to implement clipboard functionality.
This and other functionality requires a fast, portable interface which ideally allows multiple operations per call. This is most likely to require handling of the BRK opcode in a manner which differs by execution address or trapping indirection via 16 bit vectors which also differ by address. It does not help that both mechanisms are already borked. BRK is opcode zero because IRQ was originally implemented by clearing any opcode and sharing the same vector. Call conventions to overcome simultaneous execution of BRK and IRQ require flags and a loop around BRK to ensure that IRQ did not take precedence. 65816 separates BRK and IRQ vectors but the vector addresses are mutually incompatible with Acorn operating system entry points.
Full compatibility in all modes of operation might be attained by providing vectors which are not claimed by Acorn, WDC or other parties. The content of each vector would be its own 16 bit address or a special value. Indirection to such address is error but this is sufficient to invoke context switch. From here, I recommend a minimum of two system calls. The first system call would mediate resources from a kernel, such as memory allocation and yielding processing power. The second system call is for state which is visible beyond the kernel; namely I/O. This includes streams, logging, progress statistics, text console, graphical interface, files and network. For POSIX compatibility, timers would be conflated with file and network access. I further recommend that all parameters are passed on stack. Although, parameters may be little more than the system call number and the address of an array of pointers. It is trivial to specify system call number in Register X. However, should it be prescaled by two? More? And what happens if there is a native SWEET16 mode and we don't have Register X? Or a native stack mode and there are no general registers? Stack parameters are the most portable and - contrary to intuition - fastest for system calls. However, stack alignment problems remain.
Anyhow, at present, I'm mired by interrupt handling. I'd like it to be fast, portable and 100% downward compatible. I also wish to avoid known problems without adding too much code or circuitry.