6502.org • View topic - Multi-Core 64 Bit SIMD On AVR

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

Multi-Core 64 Bit SIMD On AVR

Page 1 of 1

[ 11 posts ]

Previous topic | Next topic

Author

Message

Sheep64

Post subject: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Aug 11, 2020 3:53 am

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field

I'm working on a 64 bit SIMD extension to 6502. Initially, I was working on a 64 bit RISC SIMD processor architecture without 6502 compatibility. However, various iterations of the design had poor instruction density, poor execution speed, a 350KB virtual machine or unwieldy hardware cost. Furthermore, none of these designs had a compiler, operating system or applications. Reversing my design into 6502 solved a large number of problems while only introducing minor complications; mostly around stack and flags.

I started with the 65CE02 instruction set. 65CE02 can be concisely described as a superset of NMOS 6502 with bit operations and an additional index register called Z. I thought that 65CE02 was a good basis for a RISC extension because A, X, Y and Z logically map to 2^R registers commonly found in a RISC register file. I am willing to be corrected if this is erroneous.

I discarded the instructions which make 65CE02 useful as a microcontroller in, for example, a keyboard. I replaced them with more useful instructions for 64 bit computing. The base 65CE02 instruction set is otherwise unchanged, although it is moderately escaped. Specifically, 65CE02 bit flip instructions xxxx0111 (0x?7) are now SIMD prefix instructions and 65CE02 bit test instructions xxxx1111 (0x?F) now indicate two byte register-to-register RISC instructions. Potentially, the RISC instructions could be an exact replica of ARM Thumb instructions, with the exception that instructions would not be necessarily word aligned - and the first two nybbles would be swapped. At a tentative stage of development, it is significantly easier to implement a four octal digit encoding which allows operations of the form R5=R6*R7. To reduce encodings, commutative operations, such as addition, multiplication, AND and OR may share opcodes such that an operand may be doubled or squared at the expense of uninteresting logical operations.

Opcodes 0x07, 0x17, 0x27 and 0x37 provide prefixes for a smattering of 8, 16, 32 and 64 bit SIMD operations on A, X, Y and Z. These SIMD prefixes may be combined with RISC instructions. Opcodes 0x87, 0x97, 0xA7 and 0xB7 provide prefixes for the same smattering of 8, 16, 32 and 64 bit SIMD operations on an alternate set of registers which I tentatively name C, I, J and K. These prefixes also permit a second set of RISC SIMD opcodes. For example, operations with saturation arithmetic. (As stipulated by Edson de Castro as part of the Data General Eclipse 32 bit extension, all of this can be achieved without mode bits. For more detail, see The Soul of a New Machine by Tracy Kidder. Perhaps William D. Mensch, Jr.'s 16 bit extension would be less hated and the fabled 32 bit Terbium extension would be more tractable if similar practice was observed.)

The astute will notice that I have only defined half of the prefix instructions. The remainder are reserved for future expansion. The most obvious extension is to double the number of registers. However, this obtains diminishing returns, especially when zero page is available. It is possible to define 128 bit, 256 bit, 512 bit and 1024 bit operations. This is not completely outlandish. Intel and Xtensa already have 512 bit hardware, a GPU with 1024 bit bus is available and ARMv8 specifies 2048 bit SIMD. Alternatively, unused prefixes can be used to provide more opcodes. Most obviously, 32 bit and 64 bit floating point operations within the main registers. Other candidates for floating point precision include 16 bit, 40 bit, 48 bit, 80 bit, 96 bit and 128 bit.

Processor architecture address-space varies considerably and this is complicated by architecture extension, vendor page-banking schemes and considerations for privilege protection and multithreading. A processor architecture typically defines a subset of:-

Program segment with or without shared libraries.
Working segment, such as 6502's zero page.
Data segment with or without shared memory and/or memory mapped files.
Constants segment.
One or more stack segments.
I/O segment.
Internal bus for the purpose of poking system innards.

Any attempt to extend 6502 is going to be relatively unusual due to handling of zero page. Previous efforts have allowed the stack to be moved and/or increased and allowed zero page to be moved. Previous efforts have also split and placed program and data segments within a larger address-space. 65Org02 is notable for implementing a 64 kiloword zero page. However, as word size increases, extensions have generally failed to maintain 128 or 256 memory references within zero page. This is a particular problem for a 64 bit addressing because it is only possible to hold 32 pointers in 256 bytes. Downward compatibility should be maintained. However, extended modes may require a 512B, 1KB or 2KB "page" which does not use prefix instructions.

From my experiments, I have found that it is possible split and grow program, data and stack segments using unrelated techniques. Stack segment may exceed 16MB with no ill effect. Data segment requires architecture extension. Program segment growth can be mostly solved by using longer branches and jumps. The awkward case is a data segment reference to program segment. A fairly general case is C++ vtable. As noted by Kerringham and Pike, a related problem occurs with device drivers. In this case, it is possible to stretch a 16 bit pointer by 3 bits or more and implement a prescaled indirect jump to an 8 byte, 16 byte, 64 byte or 256 byte boundary. This alignment is only required on targets referenced from data segment. Regardless, with very little modification, it is possible to run a 6502 binary which may 512KB - or considerably larger - without the use of overlays or page-banking techniques.

I'm trying to keep pointer bit shifts to a minimum because I intend to implement this on an AVR microcontroller. I believe that I am using the same model as Klaus Dormann. Klaus Dormann is quite a luminary among 6502 enthusiasts; moreso than most of the 6502's original design team. Klaus Dormann is best known for an extensive 6502 test suite and a 6502 emulator written in AVR assembly. The emulator is available on standard terms for non-commercial use. On a 20MHz ATMEGA328P, emulation runs at approximately 1MHz. There is a minor hobby of overclocking ATMEGA devices to 24MHz. This would provide approximately 1.2MHz emulation, although this is strictly outside of specification. Unfortunately, this is an optimistic case. An ATMEGA328P is the main component of an Arduino Nano and its $5 clones. These boards are typically underclocked to 16MHz to allow use with an unregulated lithium battery. (In this configuration, it is strongly advisable to avoid execution of the HCF opcode.)

After several months of consideration, I remain somewhat awed by the mind-bending insanity of implementing one processor architecture directly on another. For much of the process, it requires holding two disperate register sets in mind while implementing an almost bit-perfect mapping between them. AVR is almost a strict superset of PIC and Z80. Therefore, the task is fairly similar to writing a 6502 emulator in Z80 assembly. Furthermore, Klaus Dormann achieved this with almost clock-cycle accuracy. That's impressive.

I'm attempting the lazier task of writing an implementation in C and letting a compiler do the majority of the work. Unfortunately, the output of avr-gcc is a constant stream of WTF. Compilers for other microcontrollers are worse. Gotchas include the inability to allocate 32 bit or 64 bit globals correctly. Thankfully, a processor register context can be allocated within main() - assuming that the compiler didn't optimize away main(). Hmm. Perhaps sidestepping avr-gcc is the most sensible option.

If you don't care about SIMD on legacy instructions, don't care about commercial use and you're willing to write AVR assembly, more than 90% of an ATMEGA328P's EEPROM is available to extend the 6502 processor architecture without affecting the performance of core instructions. If you don't care about performance, do care about your sanity and you are favorable to the Lesser GNU Public License, it is possible to implement some or all of the architecture using C or C++ while using any desirable permutation of Arduino development environment, library, bootloader and run-time. If you find the Arduino IDE too patronizing, a Makefile is widely available to compile a binary, strip a binary, deploy a binary and verify a binary. This can be amalgamated with a Makefile for testing and archiving. From here, you're only a few steps away from abandoning Arduino libraries and Arduino boards. You'll be using a USB-to-SPI programmer on an Arduino clone board which is supplied blank without the Arduino bootloader. Then you'll be soldering an oscillator to a DIP-scale AVR and improvising a 6 pin ICSP connector from scrap wire. You may develop a one AVR per week habit. Maybe more. But you're in control. You're not an addict. You can quit at any time. Eventually, you'll join a support group, such as forum.6502.org. ("Hello. My name is ________ and I am a 6502 on AVR emulation addict.")

Anyhow, Arduino is a significantly more inclusive proposition and especially so given the wide distribution of Arduino hardware and support. A diligent and not particularly smart 14 year old could probably add an opcode to a working implementation. Speaking as a former diligent and not particularly smart 14 year old, it would have been an *amazing* opportunity to have a fully extensible 6502, written in C, running at a reasonable speed.

#include <rant/kids-today>
#include <rant/both-ways/snow>
#include <rant/both-ways/uphill>
#include <rant/lawn/get_off>

Getting back to 16 bit data segments, it is possible to connect a $2 microcontroller to a bank of memory and allocate 64KB data segments on 64KB boundaries. This requires no bound check and no exception processing. This saves considerable overhead when implementing a 64 bit processor architecture on an 8 bit microcontroller while allowing 6502 programs to each run in a separate memory space. A read-only program segment may be unrestricted size. Bound check is only required for branch or jump which crosses page boundary. Likewise, stack segment only requires bound check when crossing page boundary. Meanwhile, data segment and zero page never require bound check. These memory interface routines may be selectively compiled with an interpreter for 6502, 65C02, 65CE02, 65816 and the extensions of your choice. In my case, that's a 16 bit, 32 bit and/or 64 bit RISC extension. Unrolled loops for 64 bit SIMD barely fit into the 32KB EEPROM of an ATMEGA328P. If a 32 bit extension is sufficient, there is probably enough space to include a 6809 or Z80 interpreter. Or floating point instructions. Or perhaps some cryptography instructions.

While I'm vaguely competent at writing C and soldering oscillators, I doubt that I'd ever get an AVR running in a genuine 6502 or 6510 socket. The first hurdle is a lack of suitable test equipment. Regardless, instruction set extensions overlap with a project of more general interest. Specifically, for applications which are not performance sensitive, an AVR to 6502 adapter board would be cheaper than a 65816 to 6502 adapter board. It also has the intriguing possibility that extra functionality can be added. This includes wider registers, hardware multiply and native support for SWEET16.

Unfortunately, it also comes with a major disadvantage. Fully synchronous bus operation requires a three clock cycle busy loop. In a legacy system with a 2MHz synchronous bus and 6845 video adapter (or similar), a binary which doesn't look too closely at flags held on stack will run at a maximum of 40% efficiency. This is likely to break any tape or disk storage which uses timing loops. It also borks some games. However, it is sufficient for most productivity software, music software, almost anything written in BASIC and much more besides. Some efficiency gains can be made by caching zero page and/or stack. This reduces the number of bus synchronization cycles. However, eliminating writes to zero page breaks one of Commodore's incompatible page banking schemes. Likewise, eliminating writes to stack may break stack introspection via the X register. Fortunately, this is a rare case. In general, the maximum aggressiveness of caching is determined by host and application. At the expense of minor compatibility, it is possible to exceed Klaus Dormann's current results when a 20MHz AVR microcontroller is socketed in a legacy host.

In particular, significant gains can be achieved by placing a breakpoint on a legacy host's integer and/or floating point multiply routines. The breakpoint address only requires testing during JSR (and RTS) and may be implemented such that no additional legacy bus cycles are required. A positive match invokes hardware acceleration. This speeds multiplication and trigonometric functions by at least a factor of 100. Overall performance will be far less pronounced but it is likely that numerically intensive programs may run faster - despite the multiple impediments of emulation and bus synchronization. As suspected in the DTACK Grounded newsletter, issue 2, Microsoft's 6502, 40 bit floating point multiply routine does not vary across host. As noted in issue 2 and 3, patching one or more float routines is relatively trivial. Unfortunately, other formats, such as Acorn's 48 bit floating point format, may also be of interest. Fortunately, common cases can be handled with one firmware implementation. Firmware in the AVR microcontroller may seek an optional I2C boot ROM. If found then 6502 code may be copied to RAM and executed. From here, it is possible to identify the legacy host (Acorn, Apple, Atari, Commodore, other), set suitable caching and breakpoint parameters and then jump to the 6502's usual reset vector. This arrangement avoids placing an exception case inside the fetch-execute cycle which would otherwise be required to make the serial boot ROM memory mapped and executable.

It is possible to extend this system much further. In particular, it is possible to bootstrap away from a legacy host. This is easiest with a card bus system with an address space which is larger than 64KB. Starting with a suitably large amount of RAM (less than 16MB), AVR microcontroller firmware, a hacked up version of CC65 and productivity tools on the legacy host (text editor, BASIC), it is possible to implement a 6502 assembler which understands 32 bit opcode prefixes and uses 32 bit memory. From here, it is possible to self-host a C compiler and linker which runs from the legacy host's prompt. From here, it is possible to statically link a Contiki environment suitable for development. This includes web browser, file browser, text editor, desktop utilities, shell, scripting language, build system and compiler. So far, we have all of the flakiness of shared memory, co-operative multi-tasking and it is likely to be in 40*25 PETSCII (or similar). This is about midway between a Commodore 64 and Amiga Workbench 1.3. However, it is self-hosting and more than sufficient to bootstrap to IDE, VGA, 100Mb/s Ethernet and maybe a NetBSD kernel or suchlike. A multi-core system with crossbar memory is also possible. This can be shrunk to credit card size, scaled to a rack or further developed to contemporary standards. In particular, all variants may be implemented with the same version of microcontroller firmware and the same version of boot ROM.

Technically, this can be built out from a stock Arduino Uno or Arduino Nano with nothing more fiddly than 0.1 inch strip board, DIP sockets and edge connectors. However, the 16MHz or 20MHz section would be significantly more reliable if it was smaller, closer together and used shielded, multi-layer PCB. And it would be more likely to do anything if it was designed by a hardware expert. My contribution to hardware design is a state machine which may be entirely discarded. It has an initial state which allows interrupt polling and optional crossbar arbitration. Using three control lines, it is possible to ripple through six states to set a 16 bit address and write one byte. An overlapping six states allow an arbitrary one byte read. By backtracking through the states, it is possible to implement sequential read, sequential write and atomic read-modify-write. This can be implemented with one 74x138 demultiplexer chip and multiple 74x374 tri-state latches which are set when exiting a state. This is for the optional legacy host bus. One additional control line, one additional 74x138 demultiplexer and three or more latches provide a larger address space which does not require synchronization with the legacy bus. Using prefix opcodes, it is possible to access this bus as a linear address space.

It is possible to implement 6502 hardware emulation on a microcontroller which does not require I/O multiplexing. However, the cost and performance improvement is marginal. It is also significantly less applicable to casual users, such as the typical Arduino experimenter. An ATMEGA328P has an 8 pin I/O port, a 7 pin I/O port and a 6 pin I/O port. It is commonplace for pins to operate in one of five separate modes. In particular, 2 pins connect to a crystal oscillator, 4 pins are required for reprogramming and 2 pins may be used with I2C devices, such as boot ROM, thermometer and clock. For my purposes, 3 or 4 pins can be used as multiplexed control lines, 1 or 2 pins are unused while the 8 pin I/O port multiplexes bytes of address and data. This is routed to the legacy host, on board RAM or a card bus, as required. Hardware profile (socket, boot ROM, thermometer, clock, RAM, cores, crossbars, card slots) may differ significantly but all implementations use the same firmware.

I'll finish with a few questions.

Are there any architectural features which should be added, changed or removed? Klaus Dormann's 6502 emulation will run with minor changes to the memory interface. From that, you can tinker endlessly with extra instructions and registers. However, is there any essential hardware which is missing, such as UART, buttons or lights?

Could someone design a board? I can provide a much more detailed specification and I'm willing to pay for design and finished units.

Does anyone require 64 bit long long int support in CC65 or TinyC? For example, for Year 2038 compatibility?

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!

Top

BigEd

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Aug 11, 2020 8:24 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Welcome! Some interesting ideas in there. How much have you presently got working?

Top

GARTHWILSON

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Aug 11, 2020 8:40 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8510
Location: Southern California

Welcome! And I'm not sure if you're saying you're still 14 years old now; but if you are, that's some excellent writing for a 14YO!

Quote:

(As stipulated by Edson de Castro as part of the Data General Eclipse 32 bit extension, all of this can be achieved without mode bits. For more detail, see The Soul of a New Machine by Tracy Kidder. Perhaps William D. Mensch, Jr.'s 16 bit extension would be less hated and the fabled 32 bit Terbium extension would be more tractable if similar practice was observed.)

Hated the '816? No; those of us who've put some time into it like it. Those who have not are scared off by what they don't understand. For machines that don't have to run legacy 8-bit 6502 code, you clear the emulation-mode bit upon startup, and never touch it again. In my '816 Forth, I keep the accumulator in 16-bit mode almost full time, and the index registers in 8-bit mode almost full time, so those mode bits hardly ever get touched either. Op-code prefixes are not needed. Where 16-bit quantities are constantly being used, the '816 is much, much more efficient than the '02. We would however like the eight high address bits brought out, rather than having them multiplexed on the data bus. The multiplexing is not complicated, but it does add parts and certain design considerations.

Quote:

I/O segment.

Note that there are advantages to having all the addressing modes available for I/O as well, the same as for regular memory. BigDumbDinosaur here has even used the (DP,X) addressing mode to reference I/O to great advantage.

Quote:

65Org02 is notable for implementing a 64 kiloword zero page.

The 65Org32 is my baby. Address bus, data bus, and all registers are 32-bit, including the base-page register, and all operand words are 32-bit, including those referring to the base page. The base page can still access all 4 gigawords of memory; so the register's contents are really just an offset, as is the case also with the stack pointer and the data-bank register and the program-bank registers. There are no bank or page boundaries. The entire 4-gigaword address space is one bank, one page, and it wraps; so it's like the whole thing is in the zero page, and there's no need for long addressing modes, and the only need for absolute addressing modes versus ZP (actually DP) is to choose whether you want the DP offset or not. The only thing that's less than 32 bits might be the status register which probably doesn't have any need for that many bits.

So far a 65Org32 has not been made. I'm not into programmable logic, but on my list of things I want to do (but far from the top) is to do it with a PIC microcontroller (and I'll need some outboard ICs to interface to all the address- and data-bus bits). The performance will be terrible, but it would let me experiment with the instruction set. (I'm definitely not sticking up for PICs over AVR; but PIC16 is what I have a ton of experience with, and '18 and '24 are just extensions of the '16.)

Be sure to see Ed's Index of threads for improved 6502 and derived architectures. I've been saying that the 32-bit one most likely to get done now is the 65m32; but now that has been collecting dust for several years too. It's a a 65-oriented 32-bit processor concept from Michael Barry. Operands up to 16 bits wide are merged with op codes so the entire instruction can be fetched in a single clock cycle. Wider operands of course need another word in order to get the entire 32 bits.

Regarding the questions:
I've hardly wanted any extra registers; but one might be another one to use just like X. The 65816 diminishes the need though, because it has the stack-relative addressing modes so you don't have to use for example 105,X like you do on the '02. The other extra register I might like is a double-width one for multiply results (like a 64-bit for a 32x32-bit multiply) and for the dividend of a divide (like 64-bit divided by 32-bit, resulting in a 32-bit quotient and 32-bit remainder). As for UART and so on, those are parts of a microcontroller that are on the same IC but are not part of the processor.

Many here (including myself) can design boards. It's just a matter of having the time. But if you're interested in this kind of thing, you'll be wanting to design another and another and another, so you might as well pick a good CAD package and get familiar with it, with plans to use it long-term. Wikipedia has a brief comparison of a long list of CADs, many of them free, at https://en.wikipedia.org/wiki/Compariso ... A_software. See also http://pcbshopper.com/cad/ for some info and reviews on free PCB CAD.

Some prototyping board houses have had their own CAD (Express PCB comes to mind, not to be confused with PCB Express, now part of Sunstone) which you can download free but which does not produce industry-standard files, meaning that if you decide you don't like the board house's service or price, you can't just send your files to another board house. (That's probably on purpose, huh?) I would recommend using only CAD that produces industry-standard gerber 274X (not the older 274D) gerber files (gerber is for the photoplotting, ie for the copper layers, soldermask, legend, and solderpaste—the "graphics" part of the board) and excellon drill files. Some board houses will accept some popular CADs' files directly, without your having to gerberize first, for example Eagle's *.brd files; but if you want to make sure you're not limited, make sure the CAD can produce gerber 274X files if needed. RS-274X gerber photoplotter files are just text, and you can edit them with any text editor if you know the standard; so it becomes easy to do certain things your free or cheap CAD can't do directly.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

ttlworks

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Aug 11, 2020 9:45 am

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1429

Hi Sheep64, welcome to the forum.

Sheep64 wrote:

I doubt that I'd ever get an AVR running in a genuine 6502 or 6510 socket.

There might be worse things.
I'm not into Atmel microcontrollers, but it's true that ATMEGA328P doesn't have too many I\O pins.

Sheep64 wrote:

However, is there any essential hardware which is missing, such as UART, buttons or lights?

UART really might be helpful for debugging, one option would be to just "map" the on_chip peripherals of the microcontroller somewhere into the (emulated) 6502 address space,
but this might take some of the I\O pins...

Sheep64 wrote:

I can provide a much more detailed specification and I'm willing to pay for design and finished units.

I'm afraid that it might be going to be a "homework assignment" for you, but we sure would take a look at your schematics and PCB layouts and try to give some help\advice.
In the past I always had advocated Eagle when it came to making PCB layouts, but this had stopped because I'm not happy with the AutoDesk licensing "rental scheme".
Try KiCad, it's possible to work with it, and a lot of the PCB manufacturers accept the KiCad files without that you have to worry about Gerber conversion and such.

;---

Hmm... not sure if you are interested in tinkering with FPGAs, but there is the XLR8 Arduino compatible board containing an Intel\Altera 10M08 FPGA.
IIRC the VHDL only compiles with Quartus 17.1 (and not with other versions of Quartus). But a 10M08 FPGA ain't big.

The Hinj Arduino compatible board has more I\O pins and a 10M50 FPGA (biggest FPGA in the Intel\Altera MAX10 family).
But to me it looks like no external RAM is attached to the 10M50, and maybe you don't need that WiFi and Ethernet functionality.

The Terasic DE10 board (more documentation here) features a lot of I\O (plus LEDs and switches), Arduino connectors, an Altera 10M50 FPGA and 64MB of 16 Bit SDRAM.
But getting a VHDL implementation of an ATMEGA328P up and running on the DE10 and making it using that SDRAM probably would be "opening another can of worms"...

Top

Sheep64

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Mon Sep 07, 2020 7:24 pm

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field

I am horrified that I am repeating a pattern which has ongoing for at least a decade. Specifically, newbie creates account and announces an architecture extension which is never completed or commercialized. This trend was old in 2007. But do we know what happens to these efforts? Is it compiler support? Operating system support? Poor gains? Complexity of the task? Market acceptance? I was not aware of all variants given in references but there may be some omissions. Specifically, the MyCPU Project, Gigatron and 65j02 may be of interest.

Other Projects

The MyCPU Project is gorgeous. I wish that I invented it. It is an extensible rack design implemented with logic chips on the smallest 160mm*100mm EuroCards. It is loosely based upon 6502, although not binary compatible. I suspect that flag handling is incompatible. Regardless, there is a fork of TinyC and, therefore, for practical purposes, compiler support is marginally behind 6502. The CPU requires home PCB etching of five cards. Memory, serial, VGA and Ethernet require further cards. A full system is sufficient for GUI web browsing via VNC client. Due to the arrangement of the small cards, there is one 8 bit data path through the CPU. This may explain why some instructions require up to 10 clock cycles rather than a 6502's maximum of seven clock cycles.

The Gigatron is even more loosely inspired by 6502. Unfortunately, the Gigtron is a veritable set of architectural dead-ends including a Harvard architecture with an 8 bit data segment and read-only 16 bit program segment (of which 8 bit is for opcode and 8 bit is for operand); a branch delay slot; and a completely borked I/O implementation, which is lampshaded with an extension unit called Pluggy McPlugface. The least 6502 feature is the X:Y+ addressing mode which indexes an upper:lower memory address with post increment only on Y. I strongly suspect this addressing mode is used to implement VGA output. The Gigatron omits 6502 stack and interrupts. (The method to implement rastering without interrupts around an interpreter with string handling is left as an exercise.) Most impressively, the single board can be soldered from parts in 2.5 hours, it runs at more than 6 MIPS and it is supplied with a driving game which is similar to Namco's Pole Position.

I'm not sure about the details of the 65j02. I am under the impression that Jeri Ellsworth shipped an FPGA implementation which contained a (presumably) clock cycle accurate Commodore 64 emulation. I'm particularly irked about this project because it occurred to me many years ago that a legacy games console could be process shrunk until it could be contained within a joystick and, therefore, the lead from the "joystick" would be something akin to a UHF modulated NTSC signal. It was particularly galling to learn that 70,000 were sold on one day on QVC, 250,000 were manufactured in one production run and more than 500,000 were manufactured in total. I suspect many people would like to sell a similar quantity of 6502s, especially if it is possible to make a dollar or more on each one. Details are sketchy but I've been told by a friend that it is possible to expand one of these "joysticks" into a full Commodore 64 with cassette deck, multiple 1541 floppy drives, IEEE-448 printer and networking. From here, a Raspberry Pi can be used to persist 1541 protocol or bridge to a read-only FTP archive. Potentially, WebDAV allows read/write random access to a remote filesystem with encryption and authentication. This is less of a bodge than NFS, SMB or FAT32 block storage over USB, Ethernet or similar.

Most significantly, I've been told that the C64 Direct-to-TV has dormant 32 bit functionality. This is entirely plausible given that Jeri Ellsworth's previous work includes a multi-system emulator. Can anyone confirm this? Or is it the seed of a rumor of a clock cycle accurate 6502 with 32 bit extensions? Also, don't stop when you're winning! Why are there no sequels to this product? Jeri Ellsworth appears to be working on an augmented reality project. Perhaps the low energy requirement of wearable computing suggests a hacked-up, 32 bit, inspired-by-6502 design with FPU extensions?

My project appears to be almost a strict subset of André Fachat's 65002. I have also seen reference to 65GZ032 but it appears to have been run from a Yahoo mailing list which fell dormant before it was deleted. I was previously unaware that 65GZ032 was intended to run from a legacy host, like a Commodore 64. I am simultaneously encouraged and discouraged that I have vaguely conventional ideas but that my predecessors have not been wildly successful. I was previously unaware of 65Org32. I was only aware of 65Org16. Indeed, I laughed outrageously when I discovered 65Org16. You guys know how to have good, clean fun! The most common cause of processor architecture obsolescence is exhaustion of the address-space. So just make everything twice as wide! Who cares if that kills binary compatibility - or the ability to read bytes? But hey, you move from an architecture which defined less 2/3 of the instruction-space to an architecture which defines about 0.23% of the instruction-space. One fork filled this with two nybble fields allowing 16 position shifts and 16 sets of registers. For example, TXA now has 16 sources and 16 destinations. I'm disappointed that no-one suggested 16 stacks. Regardless, the 16 bit 70MHz FPGA implementation, suitable for VGA display, convinced me that a softcore 6502 extension is viable as a GPU.

Requirements

I'm still reading through references. The references are invaluable for gathering requirements. A subset of contributors would like to see a 6502 (or a extension thereof) decode a contemporary video format at a contemporary resolution. Technically, that requirement is a moving target which gets more difficult every year or so. Nevertheless, it is a sentiment which I share. I also believe that it is possible to implement a Java interpreter which outperforms x86, ARM and RISC-V.

I have noted two pairs of contradictory requirements and I may encounter more. The first pair of contradictory requirements is fast interrupts and atomic block copy. I suspect this is part of a larger problem where the usual two choices are synchronous block copy with all of the fun of cache churn or asynchronous DMA with all the fun of queuing and notification. For anyone who disdains a cache, churn isn't a consideration. The second pair of contradictory requirements is parsimonious, composable opcodes and hardware integer divide. (Float is a separate case.) One of the features which made ARMv1 notable among a long line of forgettable, interchangeable RISC processor architectures was the decision to implement addition, subtraction and multiplication but not division. The rationale was quite simple. Opcodes to perform addition, subtraction and multiplication can be used to perform arithmetic with arbitrary precision. However, this is not the case for hardware divide. Furthermore, division is the least common of the four operations and division by powers of two can be reduced to logical operations. In the miscellaneous case, where the least common operator has a denominator which is not a power of two, a subroutine (typically called _moddiv or similar) performs long division. The integer result and/or the remainder can be used after a single call. So, a=x/7; b=x%7 amortizes the work of one long division.

There are many suggestions for additional opcodes but relatively few suggestions for memory models, privileges or removing features. For example, I'm considering replacement of the absolute addressing mode with absolute+Z. The reason for this becomes apparent where the memory model is 32 bit (or larger). In this case absolute+Z should be regarded as Z+absolute where a 32 bit pointer (or larger) is supplemented with a 16 bit unsigned struct offset. Implementation divides into three cases. When emulating a legacy 6502 dialect with register Z, absolute addressing mode without Z offset must be honored. When emulating my desired dialect, temporarily clearing X, Y or Z allows X+abs, Y+abs or Z+abs to work as abs. (Likewise for alternate registers I, J and K.) When emulating a legacy 6502 dialect without register Z, hardware reset and/or executive code should clear Z. The previous two cases are now interchangeable and the most convenient one can be implemented.

Interupt Latency Versus Atomic Operations Versus Virtual Memory And Multi-Core

I don't plan any changes to the core 151 instructions of NMOS 6502 beyond Z+abs addressing. Other changes, such as longer jumps and branches, two sets of registers and SIMD, are only invoked with prefixes. However, consideration of opcodes in extensions, such as INW and DEW (16 bit atomic increment and decrement), led to a productive chain of thought. In my desired dialect, there are always 256 short memory references which are accumulator width. In 64 bit mode, this leads to a zero "page"/base "page"/whatever which is 2KB rather than the customary 256 bytes. In part, this provides compatibility with JVM opcodes. For example, 0x15 ILOAD, 0x16 LLOAD, 0x17 FLOAD, 0x18 DLOAD and 0x19 ALOAD would all map to LDA with base page addressing mode. (Java syntax looks like C++ without pointers but don't be fooled. A JVM distinctly looks like Forth targeting 6502 and 8086.)

In my desired 6502 dialect, which is derived from 65CE02, opcodes such as INW with base page addressing, are redundant. However, I was alarmed that I would be discarding atomic operations which are not present in the core 6502 instruction set. Indeed, there are no atomic operations in the core instruction set. Instead, there is a standard idiom, common across many 8 bit processor architectures:-

Clear maskable interrupts.
Do anything.
Set maskable interrupts.

Ah, yes. For any code running outside of NMI, opcodes CLI and SEI are bookends for an arbitrary, Turing complete, atomic operation. That works for single-core, single-tasking, single-threaded, single-user code and can be stretched to a few cases outside of that. In the general case, a contemporary system requires a semaphore or mutex which conveys more than one bit of state without incurring the ABA problem (like x86). POSIX compatibility may also be desirable, although Linux pthreads demonstrate that implementation doesn't have to be particularly conformant.

My first idea to provide upward and downward compatibility is probably a dud. If it is possible to read or write a pointer with one instruction then it is more than sufficient to have a maximum of 16 instructions between CLI and SEI. Unfortunately, an alpha geek said this limit was idiotic. Indeed, it interacts badly with virtual memory and would probably require a four bit counter within a privileged flag register. Hey! I created four modal bits at once!

If we truly don't care about interrupt response time then it is possible to take an idea from the Transputer processor architecture and eliminate the modal interrupt bit. The Transputer architecture has a few good ideas although most of them are taken from the Burroughs B6000 architecture. A fundamental concept in a Transputer is that a straight run of code is implicitly atomic. This dovetails with use of an internal stack which may only be three elements deep. For a bottom priority task, preservation of the internal stack is not guaranteed after a control flow instruction. This is when hardware task switching occurs. Therefore, correct operation requires the internal stack to be exhausted or contain junk at the end of each instruction sequence.

There are multiple advantages if we apply this idea to 6502. Firstly, for 6502 emulation, there is a performance advantage because interrupt polling only occurs during control flow execution. Secondly, code density is improved because all instances of CLI and SEI can be discarded. Thirdly, we can discard the corresponding modal bit - or at least when using dialects of 6502 which don't have CLI or SEI. (Don't worry! We've still got the decimal modal bit!)

Fourthly, for 6502 emulation, it aids optimization of stack. [This explanation is quite long and can be skipped.] Take the example of a straight run of code to apply (a+b)&c to 32 bit values. In this example, a+b has to be evaluated in ascending byte order but the logical operation can be applied in either order. To minimize program size, memory cycles and therefore clock cycles, intermediate values would typically be held on stack. Therefore, the least significant three bytes of a+b would typically be pushed on stack as each byte is evaluated. When progressing from the arithmetic operation to the logical operation, the top byte remains in the accumulator. The least significant three bytes are subsequently popped as the logical operation is applied in descending byte order. If no interrupts (and no stack introspection) can be guaranteed during this sequence then intermediate values can be held in a shallow internal stack. Eight bytes would handle typical cases. This eliminates the memory cycles for the push operations and the corresponding pop operations. This approach still works if we have interrupts and stack introspection. However, if we receive an interrupt signal and then make the external stack consistent, interrupt processing is delayed by a variable number of cycles. It is possible to cache to stack pointer boundaries. However, it is more efficient to make the full depth of the internal stack available at the beginning of every interrupt level, at the beginning of every subroutine and at the beginning of every condition and loop. In common with a Transputer, a deferred interrupt dovetails with exhausting an internal stack after a straight run of instructions.

Fifthly, we may not have arbitrary, Turing complete, atomic operations but we retain useful operations such as linked list insertion and fixed length block copy (as an unrolled loop). Sixthly, if we count the failure to follow a conditional branch as a straight run of code, we get atomic operations such as test-and-set, compare-and-swap and variable length block copy (as an unrolled loop). Ignoring the significantly worsened interrupt response time, many of the disadvantages are currently hypothetical. For example, implicit atomic operations interact badly with NUMA, crossbar memory and virtual memory.

Stack Versus Register

I am a fan of stack designs and single accumulator designs. However, it is undeniable that such designs get mired when they are process shrunk and/or memory grows. Why does this occur? In my experiments with processor design, I've found that there is a critical path running from accumulator to ALU and back to accumulator which is rarely less than 14 logic gates. 6502 meets this 14 gate criteria. 65816 does not meet this 14 gate criteria. Nor does Z80 or x86. RISC designs typically fail to meet the 14 logic gate criteria although this is not due to register width. One of the most obvious failings of a symmetric register file is that 2^R registers adds R logic gates to the critical path. In mitigation, the same principle applies to addressable memory. 1KB has 10 gates in the critical path. 1MB has 20 gates in the critical path. 1GB has 30 gates in the critical path. 1TB has 40 gates in the critical path. Process shrink makes this worse because it increases the relative latency - and a cache can only improve average case latency at the expense of worst case latency.

An absurd hypothetical example would be to place a 6502 directly adjacent to 1GB RAM. If these two units are interfaced with a trivial bank switching scheme then it is possible to take advantage of static column paging and pre-decode all address bits except the least significant 16. In this case, the latency of the ALU and memory are approximately balanced. However, if the two units are interfaced with a Commodore style bank switching scheme (where program and data accesses are automatically directed to separate banks) then all address bits may change with each access. In this case, the processor would have to be slowed to match the latency of the memory. It is possible to provide two sets of decode via dual port RAM. However, due to economies of scale, it is more cost effective to provide separate banks for program and data. This is commonly known as a Harvard architecture and has numerous disadvantages.

There is a qualitative difference when moving from 1KB to 1MB to 1GB to 1TB. One of these differences is that we can be more relaxed about the critical path of logic gates. This manifests as wider registers, more numerous registers and instructions such as multiplication. Obviously, this leads into Amdahl's law of diminishing returns. But don't act like 8 bit computing is the paragon of efficiency. Efficiency per transistor has been declining since Intel released the 4004 processor architecture. ("Four bits is enough to hold a decimal digit and much more besides! But you profligate whippersnappers use a whole eight bits when you only need one or two! If you keep wasting memory like this, I'll have to buy a whole 1024 bits!")

Prefix Instruction Versus Mode Bit

More seriously, a consideration of Amdahl's law has softened my stance against the 65816's modal bits. My first impression was that an extra two bits of state when entering and exiting subroutines added a large amount of cognitive load. It hadn't occurred to me that an application may be written in a style where there is a default setting. For example, the accumulator may default to 16 bit mode and exceptions may be bookended with mode flips. Alternatively, an application may default to don't care and may be explicitly set before each use. That style of use is a net positive. In particular, it aids brevity without sacrificing throughput. However, several misgivings remain. In particular, some of the instruction density is gained through ambiguous encoding. People have exploited the differences between x86_32 and x86_64 to evade malware filters. I'm working on 6502_64 so that we can sidestep this type of technical debt.

An unambiguous encoding necessarily has a lower instruction density. While I also have misgivings about prefix instructions, there is an advantage to a stateless encoding where the size of the prefix plus opcode never exceeds the size of the data. In the case where accumulator width matches the width of the external data bus, an implementation may read two or more instructions concurrently. In the most optimistic case, where alignment is ideal, one of these is a prefix and one of these is an opcode. Essentially, 8 bit data should handled with 8 bit opcodes. Larger data should be handled with larger instructions. This is downward compatible to systems with the most narrow data bus. It also provides an advantage when handling small pieces of data. For example, the latency of 8 bit addition is less than the latency of 64 bit multiplication. Therefore, there may be an advantage to fit more 8 bit addition instructions into a hypothetical cache line.

An individual mode bit is great because it allows two contradictory semantics. However, mode bits tend to hang about in gangs. This leads to unbounded complexity. I believe that Edson de Castro's objection to mode bits was the cost of developing, testing and distributing microcode which would be bank switched by one or more mode bit inputs. Invariably, one mode will be used more frequently than another. However, making this stateful is only an advantage if one mode is the most common. For example, when using a 65816, it is an advantage to default to 16 bit operation. However, when handling 8 bit data, 16 bit data, 32 bit data, 64 bit data and possibly larger pieces of data, it is conceivable that no data size is used for the majority of operations. Effectively, there is no good default. Also, starting from 65816, which defines 254 opcodes in detail and the remaining two in outline, how would I specify 32 bit operation? Conceivably, I could define a small subset of the WDM prefix specify 8/16, 32/64, 128/256 or 512/1024 operation and then use the existing mode bits to specify finer detail. Or perhaps I could specify accumulator and/or index size concurrently and use the legacy short encodings to switch down to shorter sizes?

Actually, I've considered multiple definitions for the 65816's WDM prefix; some whimsical and some facetious. The best encoding that I've devised so far is for the WDM prefix to precede two 12 bit ARM Thumb instructions but redefine the Thumb 8/32 bit operations as 32/64 bit operations. This would allow short instructions which operate on the traditional register set and default to a stateful 8/16 bit encoding. It also allows longer instructions to statelessly specify 32/64 bit encoding and operate on an extended set of registers. It also retains 65C02 and 65816 binary compatibility while approaching the instruction density of Thumb.

32/64 Bit 6502 With Binary Compatibility

There is suggestion that a 32 bit extension of 6502 does not have to be binary compatible. There is also suggestion that it is called ARMv1. This is the problem with 6502 architecture extension. It either has to extend/emulate/ignore the established 16 bit extension or it is yet another 32/64 bit stack/RISC chip - and competition is steep among RISC chips. I reversed an arbitrary design into 6502 and from that exercise, I've found that it is also possible to reverse various extensions of ARM and RISC-V into various extensions of 6502. The trick is to find a mapping of registers and opcodes. This is particularly easy if it is possible to define a hybrid instruction stream. The unusual history of 6502 made a notably sparse instruction set. I am unable to name any other general purpose processor architecture which did not define less than 2/3 of the instruction-space in the first iteration. Likewise, I am unable to name any other general purpose processor architecture which does not have a contiguous, undefined 1/4 of the instruction-space. This makes 6502 architecture extension particularly easy in a manner which does not apply elsewhere. Likewise, the short and long encodings of ARM and RISC-V provide encodings which do not exceed 1/4 of the instruction-space.

By default, ARM has four byte instructions. From ARMv1 to ARMv7, four bits of every instruction are a conditional execution field. These conditions are defined in pairs, such zero/not zero. One pair is always/never. "Never" is used for a 12 bit short encoding called Thumb. With some bit shuffling this can be mapped into 1/16 of the 6502's instruction-space. With the benefit of hindsight, RISC-V uses a two bit field to specify one 30 bit encoding or three 14 bit encodings. Therefore, RISC-V has 12 times as many short encodings compared to ARM Thumb. Regardless, RISC-V's long encoding can be mapped into a contiguous 1/4 of a 6502's instruction-space. Specifically, it fits into Group 3, Group 7, Group B and Group F. Using a hypothetical two modal bits and a modicum of bit shuffling, it is possible to execute an instruction stream which consists of:-

6502 instructions only.
6502 instructions and ARM short instructions.
6502 instructions and RISC-V long instructions.
ARM short instructions and RISC-V long instructions.

Without mode bits, it is also possible to execute 65816 instructions and pairs of ARM short instructions. I don't recommend any of this an a practical exercise. However, if two or three hardware definitions are spliced together and an assembler is similarly spliced, the unholy amalgamation will execute the output of GCC, LLVM and TinyC. I mention this for multiple reasons. Primarily, it definitively answers a few questions. Yes, it is trivial to make a 32 bit extension of 6502. In particular, it is possible to have 6502 binary compatibility within ARM. Arguably, it has always been possible to maintain 6502 binary compatibility within ARM. No, it would not have been commercially successful.

There are further reasons to mention a 6502/ARM/RISC-V hybrid. If you cannot exceed this standard then don't bother. Like the video decode problem, the acceptable threshold increases every year or so. However, the quest for single threaded performance has led to designs where an increasing cluster of units run autonomously within their own light-radius. This includes multiple integer units, FPU, MMU, SIMD, crypto and GPU. Most ridiculously, this leads to duplicate circuitry and shadow registers to model the behavior of circuits running within the same CPU. I've found that it is possible to unify most of this cluster into one unit. The trick is to treat each unit as a separate architecture and find a mapping of registers and opcodes. In this case, the primary constraint is register width.

Maximum Number Of Registers

I hope that I've given a sound theoretical basis for a 32 bit processor to have multiple symmetric registers. There is also a theoretical basis for an upper bound which never exceeds eight registers. This is derived from the graph coloring problem. Technically, the British spelling should be used because it was first applied to the geographical divisions of England. However, I'm not a pedant.

There are numerous mathematical problems which have been unresolved for centuries. For example, Fermat's Last Theorem and the Canonball Conjecture (both solved by Andrew Wiles). Well, the graph coloring problem was solved in 1976 and published in 1977. That's incredibly poor timing because it occurred during the development of 6502 and Z80. It was also quite controversial because it was a computer assisted proof. The graph coloring proof states that contiguous, two dimensional regions on a plane never require more than four colors for them to be distinct. This is very, very important for programming because we never have to handle more than four intermediate values. How does that work? All unstructured programs can be implemented as block structured programs. In the worst case, unstructured bytecode can be executed with a block structured interpreter. All block structured programs can be drawn as a state machine on a two dimensional plane without any arrows crossing. (Calls between functions can be implemented with a dispatcher in the shape of a binary tree and all types on stack are fungible.) Given Appel, Kenneth; Haken, Wolfgang (1977), Every Planar Map is Four Colorable. I. Discharging, Illinois Journal of Mathematics, a state machine on a two dimensional plane without crossing arrows is four colorable.

People argue about the merits of stack, single accumulator or more. Here we have a definitive answer. A statement written in infix notation with precedence rules, such as l=x*x+y*y, can always be reduced to stack operations without ambiguity. Furthermore, where the precision of a value exceeds the precision of an accumulator, the same stack can be used to hold intermediate chunks of data. So, one accumulator, one index register and one stack are sufficient to process one statement. Unfortunately, this is only one node of the state machine. An algorithm with multiple statements may require juggling four or more intermediate values. This can always be reduced to a maximum of four intermediate values. However, for brevity and clarity, or due to implementation quirks, it is not necessary to strictly adhere to this limit.

People disagree over this less frequent but essential case. Multiple registers and stack introspection both allow juggling of four or more intermediate values. Unfortunately, the threshold where each technique is most efficient may not be immediately apparent. For example, it may depend upon the efficiency of a given compiler and given algorithm, connectivity of register transfers, the width of the external bus, cycle time of main memory or the bit patterns of the opcodes used to select functions in a given ALU implementation. However, if addressing is 32 bit or larger then it is overwhelming likely that multiple registers are beneficial.

Even if you completely disregard the graph coloring proof and believe that five is the minimum, exceptions would be sufficiently rare to make stack operations preferable to the inefficiency of doubling registers. Unfortunately, there are some practicalities around indexing and stacks. Arbitrary data is sourced from main memory and modifies one or more accumulators. If these operations are conceptually 3-address operations applied to four accumulators then it is beneficial to provide an additional two registers. These registers would be general index registers into a common data segment or stack pointers into separate conceptual segments. However, the result may not be practical due to poor instruction density. Instruction density can be increased by adding deliberate asymmetry. Asymmetry may arise from 8086 source compatibility with 8080. For brevity of specification, correctness and throughput, binary compatibility with 6502 may be preferable. Asymmetry may require stack operations or additional registers to shuffle data. This requires seven or more general registers. Fortunately, register pressure can be reduced by not making stack pointers or program counter part of the general register file.

Stack pointers can be incremented and decremented without transfer to accumulator. Likewise, intermediate addresses can be computed in an accumulator and then transferred unidirectionally to an index register. Overall, a 3-address machine reaches an efficiency threshold when it has four accumulators, two index/temp registers and one or more stacks. A system with less registers may allow them to be used in multiple modes. For example, a 6502 index register is often used as a temporary register or stack pointer. This intentional conflation reduces cognitive burden, allows a parsimony of implementation and improves instruction density. Unfortunately, there are cases, which may occur once per statement or less frequently, in which this is a false economy. This is particularly true when considering transistor count of a processor and memory.

The ability to handle four or more intermediate values explains much of the lunacy in processor design. Notable examples include the 16 element stack window of TMS9900, register windows in SPARC, the conveyor belt concept of the Mill processor architecture and Intel's Itanium which had more than 2KB of state to context switch. It also explains why processor architectures take multiple positions. For example, the general register count in AVR and RISC-V may differ by a factor of two. Most significantly, ARM Thumb restricts access to eight general registers. There are many reasons for specifying too many registers. The most obvious case is a surplus of bits; typically to achieve 8 bit, 16 bit or 32 bit alignment. ("Hey! We've got a spare bit! Let's have two accumulators and two index registers!") This is why general purpose processor architectures always specify 1/2 of the instruction-space and typically specify more than 2/3. A confounding factor was MIL-STD-1750A which specified a processor with 16 registers. RCA1802 has 16 registers. MC68000 has 16±1 registers and a version was developed with IBM System 370 microcode, which also has 16 registers. SWEET16 has 16 registers where special registers, such as program counter and stack, backfill from R15. Many of the RISC designs influenced directly and indirectly by SWEET16 also have 16 registers and backfill from R15. This includes ARM.

If you want a futile exercise, try designing a symmetric, 3-address, 16 register processor architecture which uses 8 bit opcodes. It can be achieved by splitting the opcode into two or three fields and heavily prefixing the instructions. This includes defaulting to MOV and then prefixing any desired ALU operation. Understandably, the instruction density is terrible, in part due to the combinatorial explosion of prefix instructions. So, how have systems with 8 bit opcodes, such as x86, been sold as MIL-STD-1750A? Easy. Ignore the specification or get a waiver. Indeed. My results with processor design advanced considerably after I ignored MIL-STD-1750A.

My Processor Architecture

My first serious design was inspired by 6502, 68000, 88000 and some work with a quadtree video codec similar to HEVC. Specifically, the 2:14 split of STA:ALU operations in 6502 opcode Group 6 and Group D (ORA, AND, EOR, ADC, *STA*, LDA, CMP, SBC), 68000 symmetrical 2-address encoding and quadtree motion delta encoding inspired a 16 bit instruction format in which four nybbles may represent opcode or operand. In the preferred implementation, hexadecimal digits 0-B represent operand and hexadecimal digits C-F represent opcode. As an example, 0xF012, 0x0F12, 0x01F2 and 0x012F are distinct instructions which operate on R0, R1 and R2. In addition to aiding debug, this arrangement provides a practical quantity of 0-, 1-, 2- and 3-address instructions. It also provides a four operand, zero opcode encoding. I was quite pleased with this fusion of ideas. Despite being complete vaporware, it served as a basis to investigate numerous concepts including allocation of instructions in proportion to frequency of execution, constants in the instruction stream, illegal instructions, unaligned memory access using two separate instructions, multiple data stacks, stacks aligned with register width, biased stack pops, biased branches, atomic operations, SIMD, fused multiply-accumulate and fused compare-and-branch for the elimination of all user flags. (This was in ignorance of the MIPS technique of evaluating logical predicates.) I found that I could subdivide the fields into two inputs and two outputs (although it is most convenient for the fused compare-and-branch to use two "outputs"). In my enthusiasm, I would often define two or more functions to the same bit pattern. However, I have developed techniques to prevent duplicate allocation. The design was intended to minimize multiplexing and the latency of TTL loads. Indeed, I hoped to make a scalable 16/32/64 bit design using one, two or four identical boards with 3*32 pin connectors. (This is why I have such admiration for the MyCPU project with its five board CPU and 3*32 pin connectors.)

The ultimate failure of this design was poor instruction density. The pure load/store architecture allowed commutative operations to be reduced. This applies to ALU operations and conditional branches. However, I was unable to retain a balanced design and specify more than 6 bits of constant per 16 bit instruction. This is worse than the notoriously poor constants handling of a Transputer. Simultaneously, the mapping of opcodes from 6502 and 68000 was unfavorable. A further complication was the unconventional 4-address instruction which consumed 81/256 (31.6%) of the 16 bit instruction-space. Unfortunately, I could not find an operation which justified such allocation. I considered multiplication with 64 bit input and 128 bit output (with possible modal bits for signed/unsigned operation). I considered multiplication of the form R0=R1*R2+R3. I considered double MOV, line draw and bit rotation. Assuming such problems were solved, my vaporware would require five or more data paths to 64 bit registers. Whereas, Xtensa only requires three data paths to 512 bit registers. It is a terrible state of affairs when a person's vaporware is inferior to a shipping product.

My second serious design attempted to correct previous failings and also outperform ARM. I thought about this problem for too many years. One of the problems with processor design is that choices are relatively unconstrained. No individual choice is bad or fatal but if you say yes to everything then the result is an Intel Itanium. Never go full Itanium. Anyhow, the most significant change to my design was a more conventional arrangement of instruction fields encapsulated within a variable length instruction encoding. An instruction-space should be regarded as a binary fraction rather than a finite block. Eventually, I discovered that it is possible to specify 7 bit, 14 bit, 21 bit, 28 bit or larger instructions such that:-

A maximum of 1/2 of the instruction-space may be specified for move instructions.
A maximum of 1/4 of the instruction-space may be specified for jump and branch instructions.
A maximum of 1/128 of the instruction-space may be specified for privileged instructions.
One byte instructions may be 8 bit, single or dual accumulator operations without post-increment.
Two byte instructions may be 8/16 bit, register file operations with post-increment.
Three byte instructions may be 8/16/32/64 bit, integer/float, 3-address operations with post-increment and fused accumulate.
Five byte instructions may be 128 bit operations or larger.
Longer instructions may use Fibonacci lengths and therefore six byte instructions may be disallowed.
Commutative operations may share opcodes.
All relevent instructions may have automatic SIMD.
Control flow and privileged instructions may be one contiguous group of opcodes. This provides more options for scaled-out implementations.
Opcodes may be otherwise patterned to facilitate software compression, FPGA implementation or parallel simulation on GPU.
A breakpoint instruction may scale to the size of any other instruction.

This looks competitive with 8 bit instruction sets, 16 bit instruction sets and RISC; in particular ARM, RISC-V and Xtensa. I implemented this is C and ran it on a bottom-of-the-range Raspberry Pi (700MHz Broadcom ARMv6). Results were mixed.

A design loosely inspired by 6502 and ARM running on ARM should run quite fast. Unfortunately, from fuzz testing and benchmarking, I found that instruction execution required an average of 134 clock cycles. That was unexpectedly slow. Some of it is instruction cache miss. Much of it is derived from promoting all instructions to 28 bit representation and then performing a large number of fancy-pants tests which are typically false. Despite the inefficiency of, for example, 21 bit left shift on ARM, this virtual machine (with 0.7% efficiency on a 700MHz ARM credit card computer) executes double precision floating point addition faster than an 8MHz 68000 executes 16 bit integer addition. However, it is undeniable that my instruction format is the major impediment to this process. I got too smart. I've gone from one extreme to the opposite extreme and neither is efficient.

For my third serious design, I'm now replicating AMD's trick of incorporating 29050 (RISC) into K5 (x86). However, in my case, I'm taking the best parts of my previous designs and incorporating them into 6502. This process retains eight 64 bit registers, 3-address operations, fused multiply-accumulate, data stack, pre-scaled branches and it retains similar instruction density. Losses are minor. This includes arbitrarily deep nesting of exceptions and interrupts (which were tied to SIGUSR1 and SIGUSR2), Fibonacci lengths, post-increment, LEA [Load Effective Address], privileged instructions, general fused accumulate and my peculiar fused compare-and-branch. Gains are significant. This includes the speed of 8 bit instruction decode, zero page and byte-by-byte arbitrary precision arithmetic in binary or decimal. Two accumulators and six asymmetric index/temp registers isn't perfect but it should be sufficient for common cases. Most significantly, core functionality will work with existing compilers, operating systems and applications.

You may be surprised that I have discarded a privilege system. In an arbitrary design, it is possible to meet the Popek and Goldberg virtualization requirements by having a clear division of privileged operations. For example, in 68010, the tweaked version of the 68000 which meets such requirements, the flag register is split into an 8 bit unprivileged section and an 8 bit privileged section. Attempts to read or write the privileged section may be halted or intercepted. In an extended 6502 emulator, there is ultimately a modal dialect of one or more bits. However we never require an explicit privilege bit because:-

Dialect specifies opcodes.
Opcodes specify register visibility.
Dialect plus opcode specifies addressing mode.
Addressing mode specifies address-space.

Therefore, we have cases which may include:-

6502 binary may run in one 64KB segment on a 64KB boundary. Importantly for emulation speed, this requires no bound check.
6502 binary may run in multiple 64KB segments. For example, program and data may be split. This requires no bound check.
65816 binary may run in one 16MB segment on a 16MB boundary. This requires no bound check.
6502 binary with 64 bit RISC extensions may be restricted to 16 bit addressing. This requires no bound check.
6502 binary with 64 bit RISC extensions may be restricted to 24 bit addressing. This requires no bound check.
6502 binary with 64 bit RISC extensions may be restricted to 32 bit addressing. This requires no bound check.
6502 binary with 64 bit RISC extensions may have unrestricted access to memory and I/O. This requires no bound check.
All shared binaries may be placed in one contiguous, growable, read-only segment. This requires coarse bound check prior to indirect jump.

None this requires an explicit privilege system. Regardless, it is possible to have text or graphical applications which are protected from each other while also having one or more high bandwidth applications, such as a video player. The intention is to have the stability of 8086 Protected Mode and the creative anarchy of Amiga Workbench. If the anarchy becomes unmanageable then it is possible to wrap everything inside a 32 bit, 40 bit or 48 bit address-space. Be warned!

Implementation

The intention is to make the project affordable and accessible. Therefore, anything more niche than Arduino is not a priority.

The minimum implementation is software only running from POSIX shell. (I investigated the possibility of implementing a virtual UART to XWindows. Unfortunately, raw XWindows protocol is more difficult than six dialects of 6502.)
The software desktop implementation should compile without POSIX and run raw on a microcontroller. Initial targets are ARM, AVR and 8051. In particular, WCH's CH330 - the infamous 3¢ microcontroller. It might be heresy to suggest this but 8 bit 6502 can be used to implement 64 bit 6502. I have devised a state machine and arrangement of 74 series chips which allows 11 or 12 pins to access an external memory bank. It was initially devised for AVR to run 6502 in a 16MB address-space or larger. However, it could equally be connected to 6522 or FPGA running as stock 6502 plus 6522. Or used for an entirely different purpose.
While hunting a bug, I inadvertently wrote enough library to replace Arduino digital I/O. Therefore, my AVR software is 100% Arduino compatible, with or without the official Arduino LGPL library. Furthermore, my version is smaller, faster and MISRA compliant. Anyhow, minimum hardware cost is USD5 for clone Arduino Nano, optional USD7 to program a blank Nano via SPI and USD20 for components.
I hope to have USB and/or SPI AVR in a 6502 socket with jumpers or variants for 6510 and 65816. I believe it is reasonable to price this below the cost of an official Arduino Nano. I am extremely willing to outsource design and manufacture if I recoup expenditure, plus or minus USD100 or so. For anyone who has an undemanding application - or conversely - for anyone who desires a custom opcode, this may be preferable to genuine 6502 hardware or FPGA.
In addition to AVR or similar being used in a legacy 8 bit host, identical firmware may be used in the motherboard of a standalone card system. In particular, a One Time Programmable microcontroller with 6502 emulation may be paired with a reflashable boot ROM which cannot be updated from the booted environment.
C implementation facilitates CUDA implementation.
C with no forward references facilitates port to VHDL.

Anyhow, at present:-

150KB C with all possible warnings enabled produces clean compile on GCC for ARMv6 and Atmel's fork of GCC for AVR.
Static analysis indicates that nested macros are too long. That tends to happen when specifying a processor.
Leading parameters on all major functions share name and type. This greatly reduces parameter shuffling. For example, I wish to avoid circumstances where foo(a,b,c) calls bar(a,c). In this case, it is typical to push b on stack, copy c, call bar(a,c), return and pop b. This is worsened by return types which may differ in size.
Debug build has POSIX printf() statements after executing each instruction.
Vectors are not implemented and therefore I haven't tied POSIX SIGUSR1 and SIGUSR2 to NMI and IRQ. (From previous work, it is great fun to single-step, invoke a POSIX signal and watch a nested interrupt.)
All instructions are restartable and therefore all implemented opcodes would correctly handle bus error if it had a working vector.
I believe that I have defined a case statement for every official opcode of NMOS 6502, CMOS 6502, 65816 and 65CE02. Common unofficial opcodes, such as LAX, are implemented. Block copy is unimplemented.
Some flags are present on some opcodes. This hasn't been a priority because the legacy instructions may be swapped for another implementation.
Common opcodes produce sane output. The remainder is not a priority because I'm exploring other functionality.
Addressing modes are largely untested. In previous designs, I've had numerous errors with stack and therefore this part is tested more thoroughly.
Macros for SIMD correctly generate rolled or unrolled loops according to compiler parameters. This allows a smaller but slower implementation.
Extensive use of #ifdef allows 8/16/32/64 bit SIMD to be excluded from build. Likewise for unused opcodes and addressing modes. (64 bit registers may be present but dormant.)
SIMD is applied wherever it has been tested. Unfortunately, disassembly indicates that automatic SIMD everywhere is too slow. Therefore, I will be changing implementation such that unprefixed legacy opcodes only update the bottom 8 bits of a register. (Or 16 bits, maybe, for 65816.) I hoped to have full symmetry rather than an 8:56 or 8:8:48 bit split in the registers. If my first choice was to use an 8:8:16 bit split then I'd use 68000.

Unfortunately, avr-gcc generates horrendous code. For example, assume 32 bit variables and a reasonable compiler optimization setting. In this case, something akin to (0xFFFF0000L & segment) | (0xFFFF & (reg[2] + abs)) would be expected to generate 16 bit addition and 16 bit move. Unfortunately not. Indeed, the most pressing insight I've gained from disassembling the output of 8 bit C compilers is that every variable and intermediate value should be tagged with min, max, offset and stride. That would be sufficient to identify 16 bit addition and eliminate all logical operations. Instead, I have to restate all similar code in manner which does not introduce endian problems. (I used logical expressions to hide endian problems while allowing optimization of 24 bit masks.)

Interrupt Handling

I'm concerned about interrupt speed. This is unrelated to register count, synchronous block copy or hardware divide (integer or float). Wider vectors will slow interrupt vector read at legacy frequency and data bus width. Assuming data bus width is half the vector width, this is reduced to the familiar two memory read cycles. After that, it falls behind. Following a wide vector implies that execution is not in an 8 bit mode (or 65816 mode). In the trivial case, it should be possible to indirect via 0xFFFC and switch down to a 16 bit or 24 bit address-space; possibly automatically based upon address range. For any data bus width, this requires more main memory reads cycles than 6502. It also requires a dedicated interrupt stack. Otherwise, an attempt to push program counter and interrupt flags onto user stack may incur page fault and initiate a nested interrupt. This only worsens interrupt response time and jitter.

It is possible to make interrupt handling modal. For example, interrupts may default to 65C02 mode. From here, interrupt handling could be modified to the slightly expanded 65816 mode. This is preferable for a legacy system or minimal system because it maintains maximum speed and only requires one ROM within a legacy address-space. However, this arrangement may not provide any mechanism to break out of legacy modes. It may be possible to default to 64 bit vectors from a base address outside of the legacy 64KB/16MB address-space and then statelessly or statefully switch down to legacy vectors (where mappings exist). In some cases, partial decoding of a small ROM would be required to direct interrupts into the legacy address-space. However, we have the opposite case when the bottom 16MB is not mapped to a legacy host. In this case, we lose a maximum of 1/64 of larger address-space. Ordinarily, it would be 16MB out of 4GB. This is 1/256. However, a flexible mapping would be:-

0x00000000 to 0x00FFFFFF: Modal registers which are invisible in most modes of operation. This is beneficial for emulation speed and hardware implementation.
0x01000000 to 0x0100FFFF: Optional legacy 6502 address-space.
0x01000000 to 0x01FFFFFF: Optional legacy 65816 address-space.
0x02000000 to 0x03FFFFFF: 64 bit vector base and boot firmware.
0x04000000: First suitable address for RAM is 64MB.
0x20000000: First unshadowed address suitable for DDR3 RAM may be 512MB or higher.

For systems with 64MB RAM or less, addressing should begin at 0x04000000. In other arrangements, RAM may begin at a small set of alternative addresses or a maximum of 64MB is overshadowed. This is minor compared to x86 where it is typical to lose 3/8 of a 32 bit address-space to PCI mapping and therefore up to 1.5GB RAM may be unavailable to 32 bit software. Regardless, mutually incompatible modes of operation are satisfied with modal registers in a chunk of address-space. This is not ideal but mode register state meets the Popek and Goldberg virtualization requirements without requiring special opcodes. Furthermore, this practice may be nested. For example, it is possible to place one or more instances within a 40 bit address-space on 4GB boundaries.

After placing programs in containers and denying access to I/O, programs will require communication with an operating system and possibly each other. I'd like the simplicity and stability of a program which runs within a 64KB address-space. However, I regularly edit text which is larger than 64KB. These are mutually conflicting requirements. Fortunately, it is possible to satisfy both requirements by providing an in-memory key/value store. In this case, individual lines of text may be stored external to a program. This technique increases memory utilization without incurring long-term memory leak or compromising integrity. It can also be used to selectively share information. For example, to implement clipboard functionality.

This and other functionality requires a fast, portable interface which ideally allows multiple operations per call. This is most likely to require handling of the BRK opcode in a manner which differs by execution address or trapping indirection via 16 bit vectors which also differ by address. It does not help that both mechanisms are already borked. BRK is opcode zero because IRQ was originally implemented by clearing any opcode and sharing the same vector. Call conventions to overcome simultaneous execution of BRK and IRQ require flags and a loop around BRK to ensure that IRQ did not take precedence. 65816 separates BRK and IRQ vectors but the vector addresses are mutually incompatible with Acorn operating system entry points.

Full compatibility in all modes of operation might be attained by providing vectors which are not claimed by Acorn, WDC or other parties. The content of each vector would be its own 16 bit address or a special value. Indirection to such address is error but this is sufficient to invoke context switch. From here, I recommend a minimum of two system calls. The first system call would mediate resources from a kernel, such as memory allocation and yielding processing power. The second system call is for state which is visible beyond the kernel; namely I/O. This includes streams, logging, progress statistics, text console, graphical interface, files and network. For POSIX compatibility, timers would be conflated with file and network access. I further recommend that all parameters are passed on stack. Although, parameters may be little more than the system call number and the address of an array of pointers. It is trivial to specify system call number in Register X. However, should it be prescaled by two? More? And what happens if there is a native SWEET16 mode and we don't have Register X? Or a native stack mode and there are no general registers? Stack parameters are the most portable and - contrary to intuition - fastest for system calls. However, stack alignment problems remain.

Anyhow, at present, I'm mired by interrupt handling. I'd like it to be fast, portable and 100% downward compatible. I also wish to avoid known problems without adding too much code or circuitry.

Attachments:

latch0-0-3.odg [35.69 KiB]
Downloaded 86 times

hypercube0-0-2.odg [41.23 KiB]
Downloaded 107 times

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!

Top

BigEd

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Mon Sep 07, 2020 9:04 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England

Well, that's quite the second post! I will confess, I have only skimmed it. What's most important, I think, is that you're having fun. The deeper you go down the rabbit hole, the fewer people will be following closely, I suspect. However, audience size is only one measure.

I'm glad to see you've looked into some of the historical choices made. Computer architecture is a massive design space, ultimately you have to make some choices, but it's surely good to have some appreciation of the landscape.

As to your opening question, why so many false starts and so few great successes, I think it's because of the gulf between idea and execution. Anyone can have an idea! The grander the idea, the more difficult the execution - see the difference between 32016 and ARM for example, as practical prospects. You're quite right that successful execution of an architectural idea involves more of an ecosystem than just the CPU design (which is itself rather more than just the filling out of an instruction encoding.)

In my forays into this space, I'd say that the first thing needed is the emulator, and the second is the assembler. Then the HDL, the simulation, and the experimentation with microarchitecture. If your design isn't super simple, you'll need a testsuite. For some trajectories you'll need a compiler and a debugger. You'll need a monitor and a loader. If you hope to productise and have commercial success, you'll need to do teamwork, collaboration, negotiation, organisation. You'll need to be able to convince people that you can do what you claim, and will fulfil your promises. It's a very tall order to have all these skills covered in just one person. You'll probably need some luck, and good communication skills.

You might be interested in the various outcomes of the One Page Computing Challenge, which was all about doing something interesting and fun with an extra constraint of very limited source code size. We made some RISCy processors hooked up to a BBC Micro and with a BCPL compiler, among other things.

Agreed that Gigatron was, and is, brilliant. It adopted a constraint of TTL implementation and a simple transparent architecture, with no hint of compatibility with any other machine, but has proved very flexible and capable. I don't know how many have been shipped, but it's a significant achievement. The many and various efforts which have done something by way of extending or replicating popular platforms, with a high degree of compatibility, have an even harder journey: they need licensing, a story about volume production, capital. Spectrum Next has been (and is presently) very impressive, for having made a new and extended platform, with a stock of new software (games) being written for the platform. It was a project very long in the gestation and still is uncapitalised, but will have sold in the thousands.

My own preference, for fun, is not to worry too much about compatibility or at all about commercial success. It's worth figuring out what your own preferences are - and teaming up with people who share them.

Top

White Flame

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Mon Sep 07, 2020 10:00 pm

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674

Sheep64 wrote:

Most significantly, I've been told that the C64 Direct-to-TV has dormant 32 bit functionality. This is entirely plausible given that Jeri Ellsworth's previous work includes a multi-system emulator. Can anyone confirm this? Or is it the seed of a rumor of a clock cycle accurate 6502 with 32 bit extensions? Also, don't stop when you're winning! Why are there no sequels to this product? Jeri Ellsworth appears to be working on an augmented reality project. Perhaps the low energy requirement of wearable computing suggests a hacked-up, 32 bit, inspired-by-6502 design with FPU extensions?

The DTV can do SDRAM bursts of 4 8-bit(?) reads to execute serial-access instructions faster. Certainly it might hold a 32-bit buffer internally for that, but I've not heard of any 32-bit CPU extensions, nor seen anything related emerge in the demoscene. By "dormant", do you mean actually implemented and accessible for use but just undocumented? Because it could also imply that partial work was done in that direction without actually being exposed/finalized. Or maybe it's the just that the blitter that can do 32-bit bursts as well?

I think the business folks behind the DTV went on to do the TheC64 Mini, but I might be wrong. And of course, as an emulation box that project bears no resemblance to the DTV tech now, even if in the beginning they promised C64 hardware port compatibility.

Quote:

I am a fan of stack designs and single accumulator designs. However, it is undeniable that such designs get mired when they are process shrunk and/or memory grows. Why does this occur?

Probably because out-of-order architecture seems to be much more difficult with stacks than it is with registers. Stack operations all technically have serial dependencies with each other because of the implicit shared mutable state of the stack, whereas registers exist in parallel.

Quote:

However, when handling 8 bit data, 16 bit data, 32 bit data, 64 bit data and possibly larger pieces of data, it is conceivable that no data size is used for the majority of operations. Effectively, there is no good default. Also, starting from 65816, which defines 254 opcodes in detail and the remaining two in outline, how would I specify 32 bit operation? Conceivably, I could define a small subset of the WDM prefix specify 8/16, 32/64, 128/256 or 512/1024 operation and then use the existing mode bits to specify finer detail. Or perhaps I could specify accumulator and/or index size concurrently and use the legacy short encodings to switch down to shorter sizes?

If the main purpose is to perform vector operations, then the gains of running in parallel overwhelm any additional cost of multi-byte prefixing, so I wouldn't worry that much about length there, and keep the non-vectored stuff short. Of course, this all depends on where the most time is spent in vector processing loops: 1) decoding non-vector operations, 2) decoding vector operations, or 3) running instruction bodies including the vectored ones. Presumably, non-vectored instructions far outnumber the vectored ones, so density would favor larger vectored ones as well.

Quote:

If you want a futile exercise, try designing a symmetric, 3-address, 16 register processor architecture which uses 8 bit opcodes. It can be achieved by splitting the opcode into two or three fields and heavily prefixing the instructions. This includes defaulting to MOV and then prefixing any desired ALU operation. Understandably, the instruction density is terrible, in part due to the combinatorial explosion of prefix instructions.

You might be interested in the VCFMW talk I gave last year about the principles, design, & tradeoffs behind my AcheronVM project. It's basically an orthogonal 2-address, N-register, 8-bit opcode virtual 16-bit CPU with a strong focus on density. It's a software implementation on top of the 6502, so has different tradeoffs than a hardware or C-implemented model would. Reducing the number of operands takes a significant burden off of 6502-implemented dispatch & decode, which I define as pure overhead as it's not the actual intentional work of the instruction itself. Most instructions are only dozens of 6502 clock cycles including all dispatch & decode, and density is high enough that I don't need to bother with most bit-packing.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor

Top

dmsc

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Sep 08, 2020 12:22 am

Joined: Mon Sep 17, 2018 2:39 am
Posts: 136

Hi!

A very detailed post!

Some (random) coments:

Sheep64 wrote:

Perhaps all of the above? I like the 6502 because it is simple and available in many retro-computers. But, if I would have an extension, I would prefer to build a RISC-V machine nowadays

Even so, I think that the only valid reason for doing your 6502 extension is that you are having fun!

Quote:

The first pair of contradictory requirements is fast interrupts and atomic block copy.

The way to solve this is to have the "copy count" in registers, so you can interrupt the copy and resume it later simply executing the same copy instruction again - think on the x86 "rep movs" string instructions. Now, obviously this is not atomic, but if you really need the copy to be atomic you can set a CPU flag when a copy is in progress, then you can detect this in the interrupt handler and act accordingly.

Quote:

For a very easy and powerful way to implement atomic instructions, look at the https://en.wikipedia.org/wiki/Load-link ... onditional supported in many RISC architectures, for example the ARM "LDREX" and "STREX".

The idea is that on the load you set the "exclusive access" to the memory location, and then the store only works if no other CPU or thread accessed that location. If not, you detect that and jump back to retry the operation.

Implementing those on uniprocessors is trivial: on an load-link instruction you set a new CPU flag to 1, and on the store-conditional you only perform the store if the flag is 1 and copy the value of the flag into the Z flag. On any interrupt, you simply set the new CPU flag to 0.

On multiprocessors the implementation is more difficult, but any multi-processor bus lock procedure would also be difficult.

Quote:

One of the most obvious failings of a symmetric register file is that 2^R registers adds R logic gates to the critical path.

Even if that is true, in practice those logic gate delays can be hidden because you also need to decode the ALU operation, and the source and destination register decode happens in parallel. So, your critical path is the longest of the source-register select, target-register select and ALU select.

The ARM1 has all of those about the same length, and also added the barrel-shift decoding in parallel (another 32 input decoder), so the timings are really balanced.

In the 6502 instead, the critical path is in the ALU (and in particular, in the decimal-mode ALU), making the decode more unbalanced.

Quote:

The best encoding that I've devised so far is for the WDM prefix to precede two 12 bit ARM Thumb instructions but redefine the Thumb 8/32 bit operations as 32/64 bit operations.

ARM thumb instructions are 16bit length.

Quote:

By default, ARM has four byte instructions. From ARMv1 to ARMv7, four bits of every instruction are a conditional execution field. These conditions are defined in pairs, such zero/not zero. One pair is always/never. "Never" is used for a 12 bit short encoding called Thumb.

That is not correct. The original thumb instruction encoding uses the full 16 bit instruction word, and the CPU has a flag that signals the current mode. On newer ARMv7 there is also some thumb instructions that are 32 bit, bit the encoding is different than ARM instructions. Also, thumb instructions don't include the conditional bits, those are assumed as always execute (or given by the IT "if-then" instruction for a block of 4 instructions following).

Quote:

A design loosely inspired by 6502 and ARM running on ARM should run quite fast. Unfortunately, from fuzz testing and benchmarking, I found that instruction execution required an average of 134 clock cycles. That was unexpectedly slow. Some of it is instruction cache miss. Much of it is derived from promoting all instructions to 28 bit representation and then performing a large number of fancy-pants tests which are typically false. Despite the inefficiency of, for example, 21 bit left shift on ARM,

I don't understand, why you say that a 21 bit left shift is inefficient on ARM?

Have Fun!

Top

GARTHWILSON

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Tue Sep 08, 2020 3:56 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8510
Location: Southern California

I can hardly imagine you wrote all that just for our forum! How fast do you type, anyway?? That's the longest thing I've read in months, or more. I was able to follow much (definitely not all) of it, and have a few comments:

About binary compatibility: Is there really any need for that? It would probably be enough if source code didn't need heavy modification. You know your intended purpose though. You wrote, "There is suggestion that a 32-bit extension of 6502 does not have to be binary-compatible." I would go with that.
I'm not sure I totally followed the part about atomic operations. Pin 5 on the W65C02S is ML\ ("memory-lock-not"), or MLB ("memory-lock-bar," called such because originally the data sheets were done on a system that had no way to do the overbar, which I can't even put here in the forum post, but it's <span style="text-decoration: overline;">ML</span> in html), to ensure integrity in R-M-W instructions in multi-processor systems. Also, the interrupt sequence, even for IRQ (not just NMI) includes pushing the status and setting the I flag; so SEI...CLI is not needed there, and in rare situations, might even cause trouble if you have a CLI before the RTI.
Regarding prefix instructions versus mode bits: I'll comment again that in my 65816 Forth, I leave the accumulator in 16-bit mode almost full time, and the index registers in 8-bit mode almost full time. The M and X bits seldom get touched. The beginning of every Forth primitive expects a 16-bit accumulator and 8-bit index registers, and finishes the same way. Obviously it has to change for a few things, like the MVP and MVN instructions.
Occasionally someone else will suggest a kernel or OS here where all I/O is absolutely required to go through OS calls. That may be ok for office-type applications, but is absolutely unacceptable for my uses on the workbench where I need very immediate access to the I/O.
Apparently speed is not much of a concern though, if you're going to emulate it on an AVR. I myself have threatened to emulate my 65Org32 with a PIC microcontroller (just because that's what I'm experienced with). It would be slower than a seven-year itch with 14 years to scratch; but it would let me experiment with the instruction set.
I use the C32 (Cross 32) assembler, originally from Universal Cross Assemblers in Canada, but now sold by Data Sync Engineering, at http://www.datasynceng.com/c32doc.htm . It's also available at http://www.mpeforth.com/cross32.htm . It won't do your 64-bit stuff; but up to 32-bit, it comes with the files to assemble for dozens of different processors, and gives you the information to make up the files for a new processor, including one of your own design. Supposedly it's about a 40-hour project to do that. If you go through with it, you'll have a really nice macro assembler.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Sheep64

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Thu Sep 17, 2020 11:31 am

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field

Apologies for delay. I discovered 65Org32 and I've been implementing it. See viewtopic.php?f=1&t=1419&p=78332#p78332 for details.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!

Top

Sheep64

Post subject: Re: Multi-Core 64 Bit SIMD On AVR

Posted: Sat Dec 19, 2020 5:13 pm

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field

Apologies for delay. I bet you thought that I'd disappeared permanently. I've been off-line; in part due to minor illness. During my absence, I read the majority of the forum archive. I've barely started reading references but I believe that I have found many connections and opportunities which have been missed by others. My findings may require dozens of minor threads on unrelated topics before I return to this more major topic. This may take more than three months.

In response to GARTHWILSON's most pressing concerns: I believe that it is reasonable for a brief ISR [Interrupt Service Routine] written in a high level language to complete within 100 clock cycles. On typical configurations of MC68000, two uses of MOVEM may blow more than half of this budget. It is not possible for software running in a container to access I/O at this speed but there is no obligation for software to run in this configuration. The preferred arrangement is a text console running in a 16 bit address-space managing a real-time process running in a 32 bit address-space. Bare metal is an equally valid arrangement. The criteria for success is 48kHz audio recording or playback, without any FIFO trickery, using an FPGA clocked at 5MHz or less. (10MHz or less if the implementation is only clocked on one edge.)

In response to White Flame: I had no idea that your work was required. I was initially skeptical about register windowing. However, you present a coherent argument and, objectively, any doubter can count the clock cycles. Keep going. I will ensure that my work is technically compatible with your work (although GPL3 without an exception clause may be a little restrictive for a prospective application author). Regardless, I have an outline of a paged decompressor suitable for bytecode and a bytecode interpreter. I am willing to license implementation on your terms if it aids your work.

General response: The forum is heavily skewed towards hardware construction and vintage hardware. I am a programmer with a personal and professional interest in low power computing (agriculture, automotive) and high power computing (rendering, scientific). I don't enjoy circuit board layout or summing propagation delays. I would prefer to work on application software. Also, I am not working on a toy project. You might have noticed that computer security is deteriorating. I hoped that 6502 could be almost a drop-in replacement for the current disasters. My most pressing concern was how to extend the address-space while retaining the maximum security, expertise and implementation compatibility.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!

Top

Page 1 of 1

[ 11 posts ]

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

Who is online

Users browsing this forum: No registered users and 1 guest

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum