I am horrified that I am repeating
a pattern which has ongoing for at least
a decade. Specifically, newbie creates account and announces an architecture extension which is never completed or commercialized. This trend was old in 2007. But do we know what happens to these efforts? Is it compiler support? Operating system support? Poor gains? Complexity
of the task? Market acceptance? I was not aware
of all variants given in references but there may be some omissions. Specifically, the MyCPU Project, Gigatron and 65j02 may be
of interest.
Other ProjectsThe MyCPU Project is gorgeous. I wish that I invented it. It is an extensible rack design implemented with logic chips on the smallest 160mm*100mm EuroCards. It is loosely based upon 6502, although not binary compatible. I suspect that flag handling is incompatible. Regardless, there is
a fork
of TinyC and, therefore, for practical purposes, compiler support is marginally behind 6502. The CPU requires home PCB etching
of five cards. Memory, serial, VGA and Ethernet require further cards.
A full system is sufficient for GUI web browsing via VNC client. Due to the arrangement
of the small cards, there is one 8 bit data path through the CPU. This may explain why some instructions require up to 10 clock cycles rather than
a 6502's maximum
of seven clock cycles.
The Gigatron is even more loosely inspired by 6502. Unfortunately, the Gigtron is
a veritable set
of architectural dead-ends including
a Harvard architecture with an 8 bit data segment and read-only 16 bit program segment (
of which 8 bit is for opcode and 8 bit is for operand);
a branch delay slot; and
a completely borked I/O implementation, which is lampshaded with an extension unit called Pluggy McPlugface. The least 6502 feature is the X:Y+ addressing mode which indexes an upper:lower memory address with post increment only on Y. I strongly suspect this addressing mode is used to implement VGA output. The Gigatron omits 6502 stack and interrupts. (The method to implement rastering without interrupts around an interpreter with string handling is left as an exercise.) Most impressively, the single board can be soldered from parts in 2.5 hours, it runs at more than 6 MIPS and it is supplied with
a driving game which is similar to Namco's Pole Position.
I'm not sure about the details
of the 65j02. I am under the impression that Jeri Ellsworth shipped an FPGA implementation which contained
a (presumably) clock cycle accurate Commodore 64 emulation. I'm particularly irked about this project because it occurred to me many years ago that
a legacy games console could be process shrunk until it could be contained within
a joystick and, therefore, the lead from the "joystick" would be something akin to
a UHF modulated NTSC signal. It was particularly galling to learn that 70,000 were sold on one day on QVC, 250,000 were manufactured in one production run and more than 500,000 were manufactured in total. I suspect many people would like to sell
a similar quantity
of 6502s, especially if it is possible to make
a dollar or more on each one. Details are sketchy but I've been told by
a friend that it is possible to expand one
of these "joysticks" into
a full Commodore 64 with cassette deck, multiple 1541 floppy drives, IEEE-448 printer and networking. From here,
a Raspberry Pi can be used to persist 1541 protocol or bridge to
a read-only FTP archive. Potentially, WebDAV allows read/write random access to
a remote filesystem with encryption and authentication. This is less
of a bodge than NFS, SMB or FAT32 block storage over USB, Ethernet or similar.
Most significantly, I've been told that the C64 Direct-to-TV has dormant 32 bit functionality. This is entirely plausible given that Jeri Ellsworth's previous work includes
a multi-system emulator. Can anyone confirm this? Or is it the seed
of a rumor
of a clock cycle accurate 6502 with 32 bit extensions? Also, don't stop when you're winning! Why are there no sequels to this product? Jeri Ellsworth appears to be working on an augmented reality project. Perhaps the low energy requirement
of wearable computing suggests
a hacked-up, 32 bit, inspired-by-6502 design with FPU extensions?
My project appears to be almost
a strict subset
of André Fachat's 65002. I have also seen reference to 65GZ032 but it appears to have been run from
a Yahoo mailing list which fell dormant before it was deleted. I was previously unaware that 65GZ032 was intended to run from
a legacy host, like
a Commodore 64. I am simultaneously encouraged and discouraged that I have vaguely conventional ideas but that my predecessors have not been wildly successful. I was previously unaware
of 65Org32. I was only aware
of 65Org16. Indeed, I laughed outrageously when I discovered 65Org16. You guys know how to have good, clean fun! The most common cause
of processor architecture obsolescence is exhaustion
of the address-space. So just make everything twice as wide! Who cares if that kills binary compatibility - or the ability to read bytes? But hey, you move from an architecture which defined less 2/3
of the instruction-space to an architecture which defines about 0.23%
of the instruction-space. One fork filled this with two nybble fields allowing 16 position shifts and 16 sets
of registers. For example, TXA now has 16 sources and 16 destinations. I'm disappointed that no-one suggested 16 stacks. Regardless, the 16 bit 70MHz FPGA implementation, suitable for VGA display, convinced me that
a softcore 6502 extension is viable as
a GPU.
RequirementsI'm still reading through references. The references are invaluable for gathering requirements.
A subset
of contributors would like to see
a 6502 (or
a extension thereof) decode
a contemporary video format at
a contemporary resolution. Technically, that requirement is
a moving target which gets more difficult every year or so. Nevertheless, it is
a sentiment which I share. I also believe that it is possible to implement
a Java interpreter which outperforms x86, ARM and RISC-V.
I have noted two pairs
of contradictory requirements and I may encounter more. The first pair
of contradictory requirements is fast interrupts and atomic block copy. I suspect this is part
of a larger problem where the usual two choices are synchronous block copy with all
of the fun
of cache churn or asynchronous DMA with all the fun
of queuing and notification. For anyone who disdains
a cache, churn isn't
a consideration. The second pair
of contradictory requirements is parsimonious, composable opcodes and hardware integer divide. (Float is
a separate case.) One
of the features which made ARMv1 notable among
a long line
of forgettable, interchangeable RISC processor architectures was the decision to implement addition, subtraction and multiplication but not division. The rationale was quite simple. Opcodes to perform addition, subtraction and multiplication can be used to perform arithmetic with arbitrary precision. However, this is not the case for hardware divide. Furthermore, division is the least common
of the four operations and division by powers
of two can be reduced to logical operations. In the miscellaneous case, where the least common operator has
a denominator which is not
a power
of two,
a subroutine (typically called _moddiv or similar) performs long division. The integer result and/or the remainder can be used after
a single call. So,
a=x/7; b=x%7 amortizes the work
of one long division.
There are many suggestions for additional opcodes but relatively few suggestions for memory models, privileges or removing features. For example, I'm considering replacement
of the absolute addressing mode with absolute+Z. The reason for this becomes apparent where the memory model is 32 bit (or larger). In this case absolute+Z should be regarded as Z+absolute where
a 32 bit pointer (or larger) is supplemented with
a 16 bit unsigned struct offset. Implementation divides into three cases. When emulating
a legacy 6502 dialect with register Z, absolute addressing mode without Z offset must be honored. When emulating my desired dialect, temporarily clearing X, Y or Z allows X+abs, Y+abs or Z+abs to work as abs. (Likewise for alternate registers I, J and K.) When emulating
a legacy 6502 dialect without register Z, hardware reset and/or executive code should clear Z. The previous two cases are now interchangeable and the most convenient one can be implemented.
Interupt Latency Versus Atomic Operations Versus Virtual Memory And Multi-CoreI don't plan any changes to the core 151 instructions
of NMOS 6502 beyond Z+abs addressing. Other changes, such as longer jumps and branches, two sets
of registers and SIMD, are only invoked with prefixes. However, consideration
of opcodes in extensions, such as INW and DEW (16 bit atomic increment and decrement), led to
a productive chain
of thought. In my desired dialect, there are always 256 short memory references which are accumulator width. In 64 bit mode, this leads to
a zero "page"/base "page"/whatever which is 2KB rather than the customary 256 bytes. In part, this provides compatibility with JVM opcodes. For example, 0x15 ILOAD, 0x16 LLOAD, 0x17 FLOAD, 0x18 DLOAD and 0x19 ALOAD would all map to LDA with base page addressing mode. (Java syntax looks like C++ without pointers but don't be fooled.
A JVM distinctly looks like Forth targeting 6502 and 8086.)
In my desired 6502 dialect, which is derived from 65CE02, opcodes such as INW with base page addressing, are redundant. However, I was alarmed that I would be discarding atomic operations which are not present in the core 6502 instruction set. Indeed, there are no atomic operations in the core instruction set. Instead, there is
a standard idiom, common across many 8 bit processor architectures:-
- Clear maskable interrupts.
- Do anything.
- Set maskable interrupts.
Ah, yes. For any code running outside
of NMI, opcodes CLI and SEI are bookends for an arbitrary, Turing complete, atomic operation. That works for single-core, single-tasking, single-threaded, single-user code and can be stretched to
a few cases outside
of that. In the general case,
a contemporary system requires
a semaphore or mutex which conveys more than one bit
of state without incurring the ABA problem (like x86). POSIX compatibility may also be desirable, although Linux pthreads demonstrate that implementation doesn't have to be particularly conformant.
My first idea to provide upward and downward compatibility is probably
a dud. If it is possible to read or write
a pointer with one instruction then it is more than sufficient to have
a maximum
of 16 instructions between CLI and SEI. Unfortunately, an alpha geek said this limit was idiotic. Indeed, it interacts badly with virtual memory and would probably require
a four bit counter within
a privileged flag register. Hey! I created four modal bits at once!
If we truly don't care about interrupt response time then it is possible to take an idea from the Transputer processor architecture and eliminate the modal interrupt bit. The Transputer architecture has
a few good ideas although most
of them are taken from the Burroughs B6000 architecture.
A fundamental concept in
a Transputer is that
a straight run
of code is implicitly atomic. This dovetails with use
of an internal stack which may only be three elements deep. For
a bottom priority task, preservation
of the internal stack is not guaranteed after
a control flow instruction. This is when hardware task switching occurs. Therefore, correct operation requires the internal stack to be exhausted or contain junk at the end
of each instruction sequence.
There are multiple advantages if we apply this idea to 6502. Firstly, for 6502 emulation, there is
a performance advantage because interrupt polling only occurs during control flow execution. Secondly, code density is improved because all instances
of CLI and SEI can be discarded. Thirdly, we can discard the corresponding modal bit - or at least when using dialects
of 6502 which don't have CLI or SEI. (Don't worry! We've still got the decimal modal bit!)
Fourthly, for 6502 emulation, it aids optimization
of stack. [This explanation is quite long and can be skipped.] Take the example
of a straight run
of code to apply (
a+b)&c to 32 bit values. In this example,
a+b has to be evaluated in ascending byte order but the logical operation can be applied in either order. To minimize program size, memory cycles and therefore clock cycles, intermediate values would typically be held on stack. Therefore, the least significant three bytes
of a+b would typically be pushed on stack as each byte is evaluated. When progressing from the arithmetic operation to the logical operation, the top byte remains in the accumulator. The least significant three bytes are subsequently popped as the logical operation is applied in descending byte order. If no interrupts (and no stack introspection) can be guaranteed during this sequence then intermediate values can be held in
a shallow internal stack. Eight bytes would handle typical cases. This eliminates the memory cycles for the push operations and the corresponding pop operations. This approach still works if we have interrupts and stack introspection. However, if we receive an interrupt signal and then make the external stack consistent, interrupt processing is delayed by
a variable number
of cycles. It is possible to cache to stack pointer boundaries. However, it is more efficient to make the full depth
of the internal stack available at the beginning
of every interrupt level, at the beginning
of every subroutine and at the beginning
of every condition and loop. In common with
a Transputer,
a deferred interrupt dovetails with exhausting an internal stack after
a straight run
of instructions.
Fifthly, we may not have arbitrary, Turing complete, atomic operations but we retain useful operations such as linked list insertion and fixed length block copy (as an unrolled loop). Sixthly, if we count the failure to follow
a conditional branch as
a straight run
of code, we get atomic operations such as test-and-set, compare-and-swap and variable length block copy (as an unrolled loop). Ignoring the significantly worsened interrupt response time, many
of the disadvantages are currently hypothetical. For example, implicit atomic operations interact badly with NUMA, crossbar memory and virtual memory.
Stack Versus RegisterI am
a fan
of stack designs and single accumulator designs. However, it is undeniable that such designs get mired when they are process shrunk and/or memory grows. Why does this occur? In my experiments with processor design, I've found that there is
a critical path running from accumulator to ALU and back to accumulator which is rarely less than 14 logic gates. 6502 meets this 14 gate criteria. 65816 does not meet this 14 gate criteria. Nor does Z80 or x86. RISC designs typically fail to meet the 14 logic gate criteria although this is not due to register width. One
of the most obvious failings
of a symmetric register file is that 2^R registers adds R logic gates to the critical path. In mitigation, the same principle applies to addressable memory. 1KB has 10 gates in the critical path. 1MB has 20 gates in the critical path. 1GB has 30 gates in the critical path. 1TB has 40 gates in the critical path. Process shrink makes this worse because it increases the relative latency - and
a cache can only improve average case latency at the expense
of worst case latency.
An absurd hypothetical example would be to place
a 6502 directly adjacent to 1GB RAM. If these two units are interfaced with
a trivial bank switching scheme then it is possible to take advantage
of static column paging and pre-decode all address bits except the least significant 16. In this case, the latency
of the ALU and memory are approximately balanced. However, if the two units are interfaced with
a Commodore style bank switching scheme (where program and data accesses are automatically directed to separate banks) then all address bits may change with each access. In this case, the processor would have to be slowed to match the latency
of the memory. It is possible to provide two sets
of decode via dual port RAM. However, due to economies
of scale, it is more cost effective to provide separate banks for program and data. This is commonly known as
a Harvard architecture and has numerous disadvantages.
There is
a qualitative difference when moving from 1KB to 1MB to 1GB to 1TB. One
of these differences is that we can be more relaxed about the critical path
of logic gates. This manifests as wider registers, more numerous registers and instructions such as multiplication. Obviously, this leads into Amdahl's law
of diminishing returns. But don't act like 8 bit computing is the paragon
of efficiency. Efficiency per transistor has been declining since Intel released the 4004 processor architecture. ("Four bits is enough to hold
a decimal digit and much more besides! But you profligate whippersnappers use
a whole eight bits when you only need one or two! If you keep wasting memory like this, I'll have to buy
a whole 1024 bits!")
Prefix Instruction Versus Mode BitMore seriously,
a consideration
of Amdahl's law has softened my stance against the 65816's modal bits. My first impression was that an extra two bits
of state when entering and exiting subroutines added
a large amount
of cognitive load. It hadn't occurred to me that an application may be written in
a style where there is
a default setting. For example, the accumulator may default to 16 bit mode and exceptions may be bookended with mode flips. Alternatively, an application may default to don't care and may be explicitly set before each use. That style
of use is
a net positive. In particular, it aids brevity without sacrificing throughput. However, several misgivings remain. In particular, some
of the instruction density is gained through ambiguous encoding. People have exploited the differences between x86_32 and x86_64 to evade malware filters. I'm working on 6502_64 so that we can sidestep this type
of technical debt.
An unambiguous encoding necessarily has
a lower instruction density. While I also have misgivings about prefix instructions, there is an advantage to
a stateless encoding where the size
of the prefix plus opcode never exceeds the size
of the data. In the case where accumulator width matches the width
of the external data bus, an implementation may read two or more instructions concurrently. In the most optimistic case, where alignment is ideal, one
of these is
a prefix and one
of these is an opcode. Essentially, 8 bit data should handled with 8 bit opcodes. Larger data should be handled with larger instructions. This is downward compatible to systems with the most narrow data bus. It also provides an advantage when handling small pieces
of data. For example, the latency
of 8 bit addition is less than the latency
of 64 bit multiplication. Therefore, there may be an advantage to fit more 8 bit addition instructions into
a hypothetical cache line.
An individual mode bit is great because it allows two contradictory semantics. However, mode bits tend to hang about in gangs. This leads to unbounded complexity. I believe that Edson de Castro's objection to mode bits was the cost
of developing, testing and distributing microcode which would be bank switched by one or more mode bit inputs. Invariably, one mode will be used more frequently than another. However, making this stateful is only an advantage if one mode is the most common. For example, when using
a 65816, it is an advantage to default to 16 bit operation. However, when handling 8 bit data, 16 bit data, 32 bit data, 64 bit data and possibly larger pieces
of data, it is conceivable that no data size is used for the majority
of operations. Effectively, there is no good default. Also, starting from 65816, which defines 254 opcodes in detail and the remaining two in outline, how would I specify 32 bit operation? Conceivably, I could define
a small subset
of the WDM prefix specify 8/16, 32/64, 128/256 or 512/1024 operation and then use the existing mode bits to specify finer detail. Or perhaps I could specify accumulator and/or index size concurrently and use the legacy short encodings to switch down to shorter sizes?
Actually, I've considered multiple definitions for the 65816's WDM prefix; some whimsical and some facetious. The best encoding that I've devised so far is for the WDM prefix to precede two 12 bit ARM Thumb instructions but redefine the Thumb 8/32 bit operations as 32/64 bit operations. This would allow short instructions which operate on the traditional register set and default to
a stateful 8/16 bit encoding. It also allows longer instructions to statelessly specify 32/64 bit encoding and operate on an extended set
of registers. It also retains 65C02 and 65816 binary compatibility while approaching the instruction density
of Thumb.
32/64 Bit 6502 With Binary CompatibilityThere is suggestion that
a 32 bit extension
of 6502 does not have to be binary compatible. There is also suggestion that it is called ARMv1. This is the problem with 6502 architecture extension. It either has to extend/emulate/ignore the established 16 bit extension or it is yet another 32/64 bit stack/RISC chip - and competition is steep among RISC chips. I reversed an arbitrary design into 6502 and from that exercise, I've found that it is also possible to reverse various extensions
of ARM and RISC-V into various extensions
of 6502. The trick is to find
a mapping
of registers and opcodes. This is particularly easy if it is possible to define
a hybrid instruction stream. The unusual history
of 6502 made
a notably sparse instruction set. I am unable to name any other general purpose processor architecture which did not define less than 2/3
of the instruction-space in the first iteration. Likewise, I am unable to name any other general purpose processor architecture which does not have
a contiguous, undefined 1/4
of the instruction-space. This makes 6502 architecture extension particularly easy in
a manner which does not apply elsewhere. Likewise, the short and long encodings
of ARM and RISC-V provide encodings which do not exceed 1/4
of the instruction-space.
By default, ARM has four byte instructions. From ARMv1 to ARMv7, four bits
of every instruction are
a conditional execution field. These conditions are defined in pairs, such zero/not zero. One pair is always/never. "Never" is used for
a 12 bit short encoding called Thumb. With some bit shuffling this can be mapped into 1/16
of the 6502's instruction-space. With the benefit
of hindsight, RISC-V uses
a two bit field to specify one 30 bit encoding or three 14 bit encodings. Therefore, RISC-V has 12 times as many short encodings compared to ARM Thumb. Regardless, RISC-V's long encoding can be mapped into
a contiguous 1/4
of a 6502's instruction-space. Specifically, it fits into Group 3, Group 7, Group B and Group F. Using
a hypothetical two modal bits and
a modicum
of bit shuffling, it is possible to execute an instruction stream which consists
of:-
- 6502 instructions only.
- 6502 instructions and ARM short instructions.
- 6502 instructions and RISC-V long instructions.
- ARM short instructions and RISC-V long instructions.
Without mode bits, it is also possible to execute 65816 instructions and pairs
of ARM short instructions. I don't recommend any
of this an
a practical exercise. However, if two or three hardware definitions are spliced together and an assembler is similarly spliced, the unholy amalgamation will execute the output
of GCC, LLVM and TinyC. I mention this for multiple reasons. Primarily, it definitively answers
a few questions. Yes, it is trivial to make
a 32 bit extension
of 6502. In particular, it is possible to have 6502 binary compatibility within ARM. Arguably, it has always been possible to maintain 6502 binary compatibility within ARM. No, it would not have been commercially successful.
There are further reasons to mention
a 6502/ARM/RISC-V hybrid. If you cannot exceed this standard then don't bother. Like the video decode problem, the acceptable threshold increases every year or so. However, the quest for single threaded performance has led to designs where an increasing cluster
of units run autonomously within their own light-radius. This includes multiple integer units, FPU, MMU, SIMD, crypto and GPU. Most ridiculously, this leads to duplicate circuitry and shadow registers to model the behavior
of circuits running within the same CPU. I've found that it is possible to unify most
of this cluster into one unit. The trick is to treat each unit as
a separate architecture and find
a mapping
of registers and opcodes. In this case, the primary constraint is register width.
Maximum Number Of RegistersI hope that I've given
a sound theoretical basis for
a 32 bit processor to have multiple symmetric registers. There is also
a theoretical basis for an upper bound which never exceeds eight registers. This is derived from the graph coloring problem. Technically, the British spelling should be used because it was first applied to the geographical divisions
of England. However, I'm not
a pedant.
There are numerous mathematical problems which have been unresolved for centuries. For example, Fermat's Last Theorem and the Canonball Conjecture (both solved by Andrew Wiles). Well, the graph coloring problem was solved in 1976 and published in 1977. That's incredibly poor timing because it occurred during the development
of 6502 and Z80. It was also quite controversial because it was
a computer assisted proof. The graph coloring proof states that contiguous, two dimensional regions on
a plane never require more than four colors for them to be distinct. This is very, very important for programming because we never have to handle more than four intermediate values. How does that work? All unstructured programs can be implemented as block structured programs. In the worst case, unstructured bytecode can be executed with
a block structured interpreter. All block structured programs can be drawn as
a state
machine on
a two dimensional plane without any arrows crossing. (Calls between functions can be implemented with
a dispatcher in the shape
of a binary tree and all types on stack are fungible.) Given Appel, Kenneth; Haken, Wolfgang (1977),
Every Planar Map is Four Colorable. I. Discharging, Illinois Journal
of Mathematics,
a state
machine on
a two dimensional plane without crossing arrows is four colorable.
People argue about the merits
of stack, single accumulator or more. Here we have
a definitive answer.
A statement written in infix notation with precedence rules, such as l=x*x+y*y, can always be reduced to stack operations without ambiguity. Furthermore, where the precision
of a value exceeds the precision
of an accumulator, the same stack can be used to hold intermediate chunks
of data. So, one accumulator, one index register and one stack are sufficient to process one statement. Unfortunately, this is only one node
of the state
machine. An algorithm with multiple statements may require juggling four or more intermediate values. This can always be reduced to
a maximum
of four intermediate values. However, for brevity and clarity, or due to implementation quirks, it is not necessary to strictly adhere to this limit.
People disagree over this less frequent but essential case. Multiple registers and stack introspection both allow juggling
of four or more intermediate values. Unfortunately, the threshold where each technique is most efficient may not be immediately apparent. For example, it may depend upon the efficiency
of a given compiler and given algorithm, connectivity
of register transfers, the width
of the external bus, cycle time
of main memory or the bit patterns
of the opcodes used to select functions in
a given ALU implementation. However, if addressing is 32 bit or larger then it is overwhelming likely that multiple registers are beneficial.
Even if you completely disregard the graph coloring proof and believe that five is the minimum, exceptions would be sufficiently rare to make stack operations preferable to the inefficiency
of doubling registers. Unfortunately, there are some practicalities around indexing and stacks. Arbitrary data is sourced from main memory and modifies one or more accumulators. If these operations are conceptually 3-address operations applied to four accumulators then it is beneficial to provide an additional two registers. These registers would be general index registers into
a common data segment or stack pointers into separate conceptual segments. However, the result may not be practical due to poor instruction density. Instruction density can be increased by adding deliberate asymmetry. Asymmetry may arise from 8086 source compatibility with 8080. For brevity
of specification, correctness and throughput, binary compatibility with 6502 may be preferable. Asymmetry may require stack operations or additional registers to shuffle data. This requires seven or more general registers. Fortunately, register pressure can be reduced by not making stack pointers or program counter part
of the general register file.
Stack pointers can be incremented and decremented without transfer to accumulator. Likewise, intermediate addresses can be computed in an accumulator and then transferred unidirectionally to an index register. Overall,
a 3-address
machine reaches an efficiency threshold when it has four accumulators, two index/temp registers and one or more stacks.
A system with less registers may allow them to be used in multiple modes. For example,
a 6502 index register is often used as
a temporary register or stack pointer. This intentional conflation reduces cognitive burden, allows
a parsimony
of implementation and improves instruction density. Unfortunately, there are cases, which may occur once per statement or less frequently, in which this is
a false economy. This is particularly true when considering transistor count
of a processor and memory.
The ability to handle four or more intermediate values explains much
of the lunacy in processor design. Notable examples include the 16 element stack window
of TMS9900, register windows in SPARC, the conveyor belt concept
of the Mill processor architecture and Intel's Itanium which had more than 2KB
of state to context switch. It also explains why processor architectures take multiple positions. For example, the general register count in AVR and RISC-V may differ by
a factor
of two. Most significantly, ARM Thumb restricts access to eight general registers. There are many reasons for specifying too many registers. The most obvious case is
a surplus
of bits; typically to achieve 8 bit, 16 bit or 32 bit alignment. ("Hey! We've got
a spare bit! Let's have two accumulators and two index registers!") This is why general purpose processor architectures always specify 1/2
of the instruction-space and typically specify more than 2/3.
A confounding factor was MIL-STD-1750A which specified
a processor with 16 registers. RCA1802 has 16 registers. MC68000 has 16±1 registers and
a version was developed with IBM System 370 microcode, which also has 16 registers. SWEET16 has 16 registers where special registers, such as program counter and stack, backfill from R15. Many
of the RISC designs influenced directly and indirectly by SWEET16 also have 16 registers and backfill from R15. This includes ARM.
If you want
a futile exercise, try designing
a symmetric, 3-address, 16 register processor architecture which uses 8 bit opcodes. It can be achieved by splitting the opcode into two or three fields and heavily prefixing the instructions. This includes defaulting to MOV and then prefixing any desired ALU operation. Understandably, the instruction density is terrible, in part due to the combinatorial explosion
of prefix instructions. So, how have systems with 8 bit opcodes, such as x86, been sold as MIL-STD-1750A? Easy. Ignore the specification or get
a waiver. Indeed. My results with processor design advanced considerably after I ignored MIL-STD-1750A.
My Processor ArchitectureMy first serious design was inspired by 6502, 68000, 88000 and some work with
a quadtree video codec similar to HEVC. Specifically, the 2:14 split
of STA:ALU operations in 6502 opcode Group 6 and Group D (ORA, AND, EOR, ADC, *STA*, LDA, CMP, SBC), 68000 symmetrical 2-address encoding and quadtree motion delta encoding inspired
a 16 bit instruction format in which four nybbles may represent opcode or operand. In the preferred implementation, hexadecimal digits 0-B represent operand and hexadecimal digits C-F represent opcode. As an example, 0xF012, 0x0F12, 0x01F2 and 0x012F are distinct instructions which operate on R0, R1 and R2. In addition to aiding debug, this arrangement provides
a practical quantity
of 0-, 1-, 2- and 3-address instructions. It also provides
a four operand, zero opcode encoding. I was quite pleased with this fusion
of ideas. Despite being complete vaporware, it served as
a basis to investigate numerous concepts including allocation
of instructions in proportion to frequency
of execution, constants in the instruction stream, illegal instructions, unaligned memory access using two separate instructions, multiple data stacks, stacks aligned with register width, biased stack pops, biased branches, atomic operations, SIMD, fused multiply-accumulate and fused compare-and-branch for the elimination
of all user flags. (This was in ignorance
of the MIPS technique
of evaluating logical predicates.) I found that I could subdivide the fields into two inputs and two outputs (although it is most convenient for the fused compare-and-branch to use two "outputs"). In my enthusiasm, I would often define two or more functions to the same bit pattern. However, I have developed techniques to prevent duplicate allocation. The design was intended to minimize multiplexing and the latency
of TTL loads. Indeed, I hoped to make
a scalable 16/32/64 bit design using one, two or four identical boards with 3*32 pin connectors. (This is why I have such admiration for the MyCPU project with its five board CPU and 3*32 pin connectors.)
The ultimate failure
of this design was poor instruction density. The pure load/store architecture allowed commutative operations to be reduced. This applies to ALU operations and conditional branches. However, I was unable to retain
a balanced design and specify more than 6 bits
of constant per 16 bit instruction. This is worse than the notoriously poor constants handling
of a Transputer. Simultaneously, the mapping
of opcodes from 6502 and 68000 was unfavorable.
A further complication was the unconventional 4-address instruction which consumed 81/256 (31.6%)
of the 16 bit instruction-space. Unfortunately, I could not find an operation which justified such allocation. I considered multiplication with 64 bit input and 128 bit output (with possible modal bits for signed/unsigned operation). I considered multiplication
of the form R0=R1*R2+R3. I considered double MOV, line draw and bit rotation. Assuming such problems were solved, my vaporware would require five or more data paths to 64 bit registers. Whereas, Xtensa only requires three data paths to 512 bit registers. It is
a terrible state
of affairs when
a person's vaporware is inferior to
a shipping product.
My second serious design attempted to correct previous failings and also outperform ARM. I thought about this problem for too many years. One
of the problems with processor design is that choices are relatively unconstrained. No individual choice is bad or fatal but if you say yes to everything then the result is an Intel Itanium. Never go full Itanium. Anyhow, the most significant change to my design was
a more conventional arrangement
of instruction fields encapsulated within
a variable length instruction encoding. An instruction-space should be regarded as
a binary fraction rather than
a finite block. Eventually, I discovered that it is possible to specify 7 bit, 14 bit, 21 bit, 28 bit or larger instructions such that:-
- A maximum of 1/2 of the instruction-space may be specified for move instructions.
- A maximum of 1/4 of the instruction-space may be specified for jump and branch instructions.
- A maximum of 1/128 of the instruction-space may be specified for privileged instructions.
- One byte instructions may be 8 bit, single or dual accumulator operations without post-increment.
- Two byte instructions may be 8/16 bit, register file operations with post-increment.
- Three byte instructions may be 8/16/32/64 bit, integer/float, 3-address operations with post-increment and fused accumulate.
- Five byte instructions may be 128 bit operations or larger.
- Longer instructions may use Fibonacci lengths and therefore six byte instructions may be disallowed.
- Commutative operations may share opcodes.
- All relevent instructions may have automatic SIMD.
- Control flow and privileged instructions may be one contiguous group of opcodes. This provides more options for scaled-out implementations.
- Opcodes may be otherwise patterned to facilitate software compression, FPGA implementation or parallel simulation on GPU.
- A breakpoint instruction may scale to the size of any other instruction.
This looks competitive with 8 bit instruction sets, 16 bit instruction sets and RISC; in particular ARM, RISC-V and Xtensa. I implemented this is C and ran it on
a bottom-
of-the-range Raspberry Pi (700MHz Broadcom ARMv6). Results were mixed.
A design loosely inspired by 6502 and ARM running on ARM should run quite fast. Unfortunately, from fuzz testing and benchmarking, I found that instruction execution required an average
of 134 clock cycles. That was unexpectedly slow. Some
of it is instruction cache miss. Much
of it is derived from promoting all instructions to 28 bit representation and then performing
a large number
of fancy-pants tests which are typically false. Despite the inefficiency
of, for example, 21 bit left shift on ARM, this virtual
machine (with 0.7% efficiency on
a 700MHz ARM credit card computer) executes double precision floating point addition faster than an 8MHz 68000 executes 16 bit integer addition. However, it is undeniable that my instruction format is the major impediment to this process. I got too smart. I've gone from one extreme to the opposite extreme and neither is efficient.
For my third serious design, I'm now replicating AMD's trick
of incorporating 29050 (RISC) into K5 (x86). However, in my case, I'm taking the best parts
of my previous designs and incorporating them into 6502. This process retains eight 64 bit registers, 3-address operations, fused multiply-accumulate, data stack, pre-scaled branches and it retains similar instruction density. Losses are minor. This includes arbitrarily deep nesting
of exceptions and interrupts (which were tied to SIGUSR1 and SIGUSR2), Fibonacci lengths, post-increment, LEA [Load Effective Address], privileged instructions, general fused accumulate and my peculiar fused compare-and-branch. Gains are significant. This includes the speed
of 8 bit instruction decode, zero page and byte-by-byte arbitrary precision arithmetic in binary or decimal. Two accumulators and six asymmetric index/temp registers isn't perfect but it should be sufficient for common cases. Most significantly, core functionality will work with existing compilers, operating systems and applications.
You may be surprised that I have discarded
a privilege system. In an arbitrary design, it is possible to meet the Popek and Goldberg virtualization requirements by having
a clear division
of privileged operations. For example, in 68010, the tweaked version
of the 68000 which meets such requirements, the flag register is split into an 8 bit unprivileged section and an 8 bit privileged section. Attempts to read or write the privileged section may be halted or intercepted. In an extended 6502 emulator, there is ultimately
a modal dialect
of one or more bits. However we never require an explicit privilege bit because:-
- Dialect specifies opcodes.
- Opcodes specify register visibility.
- Dialect plus opcode specifies addressing mode.
- Addressing mode specifies address-space.
Therefore, we have cases which may include:-
- 6502 binary may run in one 64KB segment on a 64KB boundary. Importantly for emulation speed, this requires no bound check.
- 6502 binary may run in multiple 64KB segments. For example, program and data may be split. This requires no bound check.
- 65816 binary may run in one 16MB segment on a 16MB boundary. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 16 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 24 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may be restricted to 32 bit addressing. This requires no bound check.
- 6502 binary with 64 bit RISC extensions may have unrestricted access to memory and I/O. This requires no bound check.
- All shared binaries may be placed in one contiguous, growable, read-only segment. This requires coarse bound check prior to indirect jump.
None this requires an explicit privilege system. Regardless, it is possible to have text or graphical applications which are protected from each other while also having one or more high bandwidth applications, such as
a video player. The intention is to have the stability
of 8086 Protected Mode and the creative anarchy
of Amiga Workbench. If the anarchy becomes unmanageable then it is possible to wrap everything inside
a 32 bit, 40 bit or 48 bit address-space. Be warned!
ImplementationThe intention is to make the project affordable and accessible. Therefore, anything more niche than Arduino is not
a priority.
- The minimum implementation is software only running from POSIX shell. (I investigated the possibility of implementing a virtual UART to XWindows. Unfortunately, raw XWindows protocol is more difficult than six dialects of 6502.)
- The software desktop implementation should compile without POSIX and run raw on a microcontroller. Initial targets are ARM, AVR and 8051. In particular, WCH's CH330 - the infamous 3¢ microcontroller. It might be heresy to suggest this but 8 bit 6502 can be used to implement 64 bit 6502. I have devised a state machine and arrangement of 74 series chips which allows 11 or 12 pins to access an external memory bank. It was initially devised for AVR to run 6502 in a 16MB address-space or larger. However, it could equally be connected to 6522 or FPGA running as stock 6502 plus 6522. Or used for an entirely different purpose.
- While hunting a bug, I inadvertently wrote enough library to replace Arduino digital I/O. Therefore, my AVR software is 100% Arduino compatible, with or without the official Arduino LGPL library. Furthermore, my version is smaller, faster and MISRA compliant. Anyhow, minimum hardware cost is USD5 for clone Arduino Nano, optional USD7 to program a blank Nano via SPI and USD20 for components.
- I hope to have USB and/or SPI AVR in a 6502 socket with jumpers or variants for 6510 and 65816. I believe it is reasonable to price this below the cost of an official Arduino Nano. I am extremely willing to outsource design and manufacture if I recoup expenditure, plus or minus USD100 or so. For anyone who has an undemanding application - or conversely - for anyone who desires a custom opcode, this may be preferable to genuine 6502 hardware or FPGA.
- In addition to AVR or similar being used in a legacy 8 bit host, identical firmware may be used in the motherboard of a standalone card system. In particular, a One Time Programmable microcontroller with 6502 emulation may be paired with a reflashable boot ROM which cannot be updated from the booted environment.
- C implementation facilitates CUDA implementation.
- C with no forward references facilitates port to VHDL.
Anyhow, at present:-
- 150KB C with all possible warnings enabled produces clean compile on GCC for ARMv6 and Atmel's fork of GCC for AVR.
- Static analysis indicates that nested macros are too long. That tends to happen when specifying a processor.
- Leading parameters on all major functions share name and type. This greatly reduces parameter shuffling. For example, I wish to avoid circumstances where foo(a,b,c) calls bar(a,c). In this case, it is typical to push b on stack, copy c, call bar(a,c), return and pop b. This is worsened by return types which may differ in size.
- Debug build has POSIX printf() statements after executing each instruction.
- Vectors are not implemented and therefore I haven't tied POSIX SIGUSR1 and SIGUSR2 to NMI and IRQ. (From previous work, it is great fun to single-step, invoke a POSIX signal and watch a nested interrupt.)
- All instructions are restartable and therefore all implemented opcodes would correctly handle bus error if it had a working vector.
- I believe that I have defined a case statement for every official opcode of NMOS 6502, CMOS 6502, 65816 and 65CE02. Common unofficial opcodes, such as LAX, are implemented. Block copy is unimplemented.
- Some flags are present on some opcodes. This hasn't been a priority because the legacy instructions may be swapped for another implementation.
- Common opcodes produce sane output. The remainder is not a priority because I'm exploring other functionality.
- Addressing modes are largely untested. In previous designs, I've had numerous errors with stack and therefore this part is tested more thoroughly.
- Macros for SIMD correctly generate rolled or unrolled loops according to compiler parameters. This allows a smaller but slower implementation.
- Extensive use of #ifdef allows 8/16/32/64 bit SIMD to be excluded from build. Likewise for unused opcodes and addressing modes. (64 bit registers may be present but dormant.)
- SIMD is applied wherever it has been tested. Unfortunately, disassembly indicates that automatic SIMD everywhere is too slow. Therefore, I will be changing implementation such that unprefixed legacy opcodes only update the bottom 8 bits of a register. (Or 16 bits, maybe, for 65816.) I hoped to have full symmetry rather than an 8:56 or 8:8:48 bit split in the registers. If my first choice was to use an 8:8:16 bit split then I'd use 68000.
Unfortunately, avr-gcc generates horrendous code. For example, assume 32 bit variables and
a reasonable compiler optimization setting. In this case, something akin to (0xFFFF0000L & segment) | (0xFFFF & (reg[2] + abs)) would be expected to generate 16 bit addition and 16 bit move. Unfortunately not. Indeed, the most pressing insight I've gained from disassembling the output
of 8 bit C compilers is that every variable and intermediate value should be tagged with min, max, offset and stride. That would be sufficient to identify 16 bit addition and eliminate all logical operations. Instead, I have to restate all similar code in manner which does not introduce endian problems. (I used logical expressions to hide endian problems while allowing optimization
of 24 bit masks.)
Interrupt HandlingI'm concerned about interrupt speed. This is unrelated to register count, synchronous block copy or hardware divide (integer or float). Wider vectors will slow interrupt vector read at legacy frequency and data bus width. Assuming data bus width is half the vector width, this is reduced to the familiar two memory read cycles. After that, it falls behind. Following
a wide vector implies that execution is not in an 8 bit mode (or 65816 mode). In the trivial case, it should be possible to indirect via 0xFFFC and switch down to
a 16 bit or 24 bit address-space; possibly automatically based upon address range. For any data bus width, this requires more main memory reads cycles than 6502. It also requires
a dedicated interrupt stack. Otherwise, an attempt to push program counter and interrupt flags onto user stack may incur page fault and initiate
a nested interrupt. This only worsens interrupt response time and jitter.
It is possible to make interrupt handling modal. For example, interrupts may default to 65C02 mode. From here, interrupt handling could be modified to the slightly expanded 65816 mode. This is preferable for
a legacy system or minimal system because it maintains maximum speed and only requires one ROM within
a legacy address-space. However, this arrangement may not provide any mechanism to break out
of legacy modes. It may be possible to default to 64 bit vectors from
a base address outside
of the legacy 64KB/16MB address-space and then statelessly or statefully switch down to legacy vectors (where mappings exist). In some cases, partial decoding
of a small ROM would be required to direct interrupts into the legacy address-space. However, we have the opposite case when the bottom 16MB is not mapped to
a legacy host. In this case, we lose
a maximum
of 1/64
of larger address-space. Ordinarily, it would be 16MB out
of 4GB. This is 1/256. However,
a flexible mapping would be:-
- 0x00000000 to 0x00FFFFFF: Modal registers which are invisible in most modes of operation. This is beneficial for emulation speed and hardware implementation.
- 0x01000000 to 0x0100FFFF: Optional legacy 6502 address-space.
- 0x01000000 to 0x01FFFFFF: Optional legacy 65816 address-space.
- 0x02000000 to 0x03FFFFFF: 64 bit vector base and boot firmware.
- 0x04000000: First suitable address for RAM is 64MB.
- 0x20000000: First unshadowed address suitable for DDR3 RAM may be 512MB or higher.
For systems with 64MB RAM or less, addressing should begin at 0x04000000. In other arrangements, RAM may begin at
a small set
of alternative addresses or
a maximum
of 64MB is overshadowed. This is minor compared to x86 where it is typical to lose 3/8
of a 32 bit address-space to PCI mapping and therefore up to 1.5GB RAM may be unavailable to 32 bit software. Regardless, mutually incompatible modes
of operation are satisfied with modal registers in
a chunk
of address-space. This is not ideal but mode register state meets the Popek and Goldberg virtualization requirements without requiring special opcodes. Furthermore, this practice may be nested. For example, it is possible to place one or more instances within
a 40 bit address-space on 4GB boundaries.
After placing programs in containers and denying access to I/O, programs will require communication with an operating system and possibly each other. I'd like the simplicity and stability
of a program which runs within
a 64KB address-space. However, I regularly edit text which is larger than 64KB. These are mutually conflicting requirements. Fortunately, it is possible to satisfy both requirements by providing an in-memory key/value store. In this case, individual lines
of text may be stored external to
a program. This technique increases memory utilization without incurring long-term memory leak or compromising integrity. It can also be used to selectively share information. For example, to implement clipboard functionality.
This and other functionality requires
a fast, portable interface which ideally allows multiple operations per call. This is most likely to require handling
of the BRK opcode in
a manner which differs by execution address or trapping indirection via 16 bit vectors which also differ by address. It does not help that both mechanisms are already borked. BRK is opcode zero because IRQ was originally implemented by clearing any opcode and sharing the same vector. Call conventions to overcome simultaneous execution
of BRK and IRQ require flags and
a loop around BRK to ensure that IRQ did not take precedence. 65816 separates BRK and IRQ vectors but the vector addresses are mutually incompatible with Acorn operating system entry points.
Full compatibility in all modes
of operation might be attained by providing vectors which are not claimed by Acorn, WDC or other parties. The content
of each vector would be its own 16 bit address or
a special value. Indirection to such address is error but this is sufficient to invoke context switch. From here, I recommend
a minimum
of two system calls. The first system call would mediate resources from
a kernel, such as memory allocation and yielding processing power. The second system call is for state which is visible beyond the kernel; namely I/O. This includes streams, logging, progress statistics, text console, graphical interface, files and network. For POSIX compatibility, timers would be conflated with file and network access. I further recommend that all parameters are passed on stack. Although, parameters may be little more than the system call number and the address
of an array
of pointers. It is trivial to specify system call number in Register X. However, should it be prescaled by two? More? And what happens if there is
a native SWEET16 mode and we don't have Register X? Or
a native stack mode and there are no general registers? Stack parameters are the most portable and - contrary to intuition - fastest for system calls. However, stack alignment problems remain.
Anyhow, at present, I'm mired by interrupt handling. I'd like it to be fast, portable and 100% downward compatible. I also wish to avoid known problems without adding too much code or circuitry.