6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Apr 23, 2024 8:12 pm

All times are UTC




Post new topic Reply to topic  [ 1 post ] 
Author Message
PostPosted: Sun Apr 11, 2021 1:34 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
I've been inspired by proponents of low cost and accessible 6502 implementations/extensions on AVR/Arduino (2m5, MicroCoreLabs, randyhyde), a collective ambivalence to AVR/ARM/CPLD/FPGA and a commitment to no cost, software only implementation. Given this and Wirth's law, it occurred to me that a 6502 on 6502 simulator would be useful as the basis of processor architecture extension. Running an undifferentiated 6502 binary under simulation is normally pointless and likely to incur 30:1 performance penalty. However, extensions thereof may be specified concisely and unambiguously in 6502 assembly language. Furthermore, such extension may be integrated into the workflow of your choice and deployed in the environment of your choice. Assuming that you don't care about performance and weren't using the full address-space, the incremental cost of 6502 on 6502 simulation is close to zero. Outside of this case, the technique is extremely portable and I encourage multiple implementations to be made available under distinct software licenses.

I found it beneficial to implement a 6502 on 6502 simulator using a template system of my own devising. It consists of a spreadsheet and script which converts a very specific arrangement of CSV data to 6502 assembly. I admit that this meta solution adds complexity. Thankfully, the complexity is bounded. I am able to unambiguously define one opcode per line and the min loop of the templating script fits on screen. This is sufficient to define approximately 2/3 of the base opcodes. Approximately 50 opcodes are defined manually; of which flow control dominates. For example, the off-by-one bug of JSR/RTS is faithfully implemented.

I considered preserving X or Y between bytecode dispatches. However, the advantage is marginal and likely to be detrimental if the architecture is extended heavily. Regardless, preservation of flags is awkward. I found that it is most convenient to hold flags in a shadow location on the top of stack. I originally devised this technique when considering a 6502 on Z80 simulator but it applies with minor adaptation to 6502 on 6502 simulation. Unfortunately, the shadow stack location introduces additional off-by-one cases. For example, when executing TSX. Furthermore, perverse instruction sequences may reveal that data is being shadowed. Admittedly, the templating already generates curious instruction sequences, such as PLP // SED // PHP // CLD. The PLP // ... // PHP conjugate ensures that the critical step occurs with the flags in place. Meanwhile the PHP // CLD sequences ensures that simulated instructions may or may not be executed in decimal without affecting other functionality of the simulator. This functionality can be omitted if decimal mode is not required. Contrariwise, it is possible to add flags which are not supported in hardware.

I initially considered NMOS 6502 simulation on CMOS 6502; maybe with a token extension, such as nybble swap, to provide an example of architecture extension. However, as the project progressed, I found that there was minimal hardship to implement all of the CMOS 6502 instructions while only using the original NMOS instructions. Indeed, if the templating system is configured such that ROR is implemented manually, there is minimal hardship (and minimal performance penalty) to implement the ROR bug workaround. The result is 65C02 simulation on 6501 hardware. If specifically required, it is possible to implement differing results of 6502/65C02/65CE02 decimal modes on systems with different or absent decimal support.

I have not stopped with commercial releases of 6502 variants. Specifically, I have investigated a hypothetical agglomeration of multiple draft extensions. The most significant contribution is Chuck Peddle's draft 6502 with Register B, which I tentatively name 6499 or 6502WRB [With Register B]. This assigns all xxxxxx11 illegal opcodes to ALU operations on another symmetric accumulator. Unfortunately, my choice of mnemonics may be less than desirable. Specifically, ORB, ANB, EOB, ADB, STB, LDB, SBB and CMB are defined symmetrically across eight addressing modes. This largely eliminates the requirement for NMOS binary compatibility mode in the Atari/Synertek 6516 processor architecture. It also leaves 11x1x100 opcodes available for operand size prefix. (On 65816 and 65CE02, these opcodes push immediate on stack; among other functions.) This triply hypothetical arrangement does not have full historical accuracy because the original instruction values are unknown. Regardless, it is easy to choose a judiciously rigged set of values in which they all co-exist.

I do not suggest that any of these extensions are good. However, they are quite easy to implement and investigate in more detail. Indeed, with a modest extension to a spreadsheet templating system, it is easy to implement arbitrarily wide and deep prefix trees. I have investigated one prefix in great detail. I do not believe that it would be particularly useful when implemented in hardware. Indeed, it may be detrimental to interrupt response performance. Regardless, it is suitable for virtual machine implementation to improve many functions (for example, instruction density of 16 bit pointer arithmetic) in a manner which is broadly compatible with existing macro assemblers.

I have seen reference in the 6502 forum archive to a suggested instruction prefix which swaps references to Register X and Register S. This provides stack indexing modes found in 65816 with the exception that indexing is tied to Register X rather than Register Y. Unfortunately, I did not understand the significance of this suggestion and therefore I apologize for the lack of attribution. Regardless, I have extended this concept such that references to P, A, X, Y are substituted with Q, B, S, Z where Q is taken from Atari/Synertek 6516, Z is taken from 65CE02 and B is taken from Chuck Peddle's draft 6502. The lack of TAB/TBA and the duplicate definitions of TSX/TXS are trivially reconciled. PHS/PLS provides the basis to investigate stack frames. Escaped versions of PHP/PLP/SEC/CLC/SED/CLD and suchlike may be used to implement mode bits or a privilege system.

I may thoroughly investigate substitution of addressing modes. For example, I have written abs,Z mode to replace absolute. Although this functionality is present, it is dormant because it reduces performance while providing no additional functionality when Register Z equals zero. Unfortunately, if a virtual machine is never initialized and Register Z is preserved between context switches then abs,Z is not 100% downward compatible.

Unfortunately, a virtual machine with all of this functionality is likely to exceed 6KB. However, it provides a duplicate accumulator and index register with no further penalty. Indeed, if the clock cycle overhead of virtual machine context switching is not a concern, it is possible to faithfully transfer P, A, X, Y while preserving extended registers (such as Q, B, Z) between calls.

I hope to make a macro assembler which uses the same scripting language as the templating system primarily because they are likely to run in the same environment. However, there are further benefits. If they read the same CSV file then mnemonics implemented in the virtual machine are immediately available in the assembler with no further configuration. I also hope to make the macro assembler compatible with dynamic binary decompression. This allows native binaries, zero of more virtual machines and/or common data formats to exceed 64KB and be dynamically decompressed into common or separate buffers. Decompression is intended to be compatible with atypical stack handling and therefore dynamic decompression should be compatible with any stack frame extension.

So far, 6502 on 6502 simulation has allowed me evaluate abandoned architectural choices with moderate historical fidelity. I believe that single accumulator, asymmetric index registers was the best choice. I also believe that 65CE02 single accumulator, triple index register is preferable to dual accumulator, dual symmetric index register. However, cases are marginal and it took me more than two weeks to reach this conclusion. The symmetry of an 8 bit, dual accumulator, dual index register is alluring. It also has the obvious use case of 16 bit pointer arithmetic. However, in the general case, the ability to quickly source operands from more locations is preferable. This is particularly apparent when implementing the extended 65C02 instructions in a smaller subset. I have also found that multiple accumulator designs create their own pressure for architecture extension and this may partially explain the dominance of designs with semi-regular register files. (Separately, I've found that 2^R registers provide ALU/RAM latency matching across R generations of Moore's law.)

In the trivial case, if two 8 bit accumulators are able to perform 16 bit pointer arithmetic then the next bottleneck is the implementation of 16 bit index registers. However, if each index register is 16 bit then it creates pressure for each accumulator to also be 16 bit. At which point, it becomes trivial to implement 32 bit pointer arithmetic and the cycle repeats. The limit to this process is availability of opcodes to implement increasingly redundant legacy instruction sequences. (A good example is the Intel x86 AVX bit matrix transpose instruction.) Ignoring x86 insanity, this process is generally limited due to most processor architectures defining more 2/3 instruction-space in the first iteration of the hardware. 6502 is a curious exception to the 2/3 rule; in part due to the removal of the Motorola 6800 style Register B which occurred primarily for economic reasons - but which are also aligned to formal methods and interrupt response.

This self-generated architectural extension pressure doesn't apply to mono accumulator designs. This doesn't make them bad. It only makes them out-moded. Indeed, I believe this is an advantage because it allows a more considered approach to architectural extension, if any. In particular, I believe that it is beneficial to recognize the distinction between single accumulator and multiple accumulator designs and harness it in a controlled manner inside a virtual machine. Even if this is a architectural dead-end, even if it exists as a software only implementation with considerable performance penalty, restoration of Register B is useful for a subset of cases, such as utility programs. In general, baseline functionality may match or exceed 65C02 while providing a familiar register model, binary compatibility with familiar macro assemblers and high level language compatibility (BASIC, Forth, C). It provides instruction density, pointer manipulation and logical operations which are comparable or superior to other virtual machines.

I have yet to mention the most important facet of this technique. We are able to share concrete, tested architectural extensions which work across ASIC, CPLD, FPGA, AVR/Arduino, ARM/Teensy (such as MCL65+), beebjit, MAME, VICE, Py65 and other environments.


Attachments:
reg-transfer0-0-2.gif
reg-transfer0-0-2.gif [ 18.24 KiB | Viewed 5783 times ]
reg-transfer0-0-1.odg [6.3 KiB]
Downloaded 144 times

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: