6502.org • View topic - asynchronous 6502

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Hardware

All times are UTC

asynchronous 6502

Page 1 of 2

[ 26 posts ]

Go to page 1, 2 Next

Previous topic | Next topic

Author

Message

BigEd

Post subject: asynchronous 6502

Posted: Thu Dec 24, 2009 2:30 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

German-speakers or the intrepid might be interested in this paper on asynchronous implemention techniques for FPGA, which they illustrate with a 6502 re-implementation. To do that, they analyse the 6502 as-is - see page 73 or so.

The block diagram on page 127 looks clearer to me than the usual one:

Attachment:

sync6502block.png [ 46.46 KiB | Viewed 1016 times ]

(The MuxDataIn_new is both multiplexor and incrementer for PC.)

Edit: replaced the diagram. Also credits: M. Balke, T. Dettbarn, R. Homann, S. Jaenicke, T. Köhler, H. Mersch, and H. Weiss: Eine asynchrone Implementierung eines Microprozessors auf einem FPGA
Technical Report TR2001-04, Technische Fakultät, Universität Bielefeld (2001). ISSN 0964-7831

Last edited by BigEd on Fri Dec 13, 2013 4:05 pm, edited 2 times in total.

Top

BigDumbDinosaur

Post subject: asynchronous 6502

Posted: Thu Dec 24, 2009 7:06 pm

Joined: Thu May 28, 2009 9:46 pm
Posts: 8144
Location: Midwestern USA

It looks interesting, although my knowledge of German is almost as extensive as my understanding of how to successfully perform open-heart surgery on a live patient.

Perhaps a translation into the King's English will eventually be posted, since I'm sure interest in this would extend well beyond Germany's borders.

It continues to amaze me just how much interest there is today in a piece of silicon that was devised some 35 years ago. I've never lost interest in the little processor that could, even after some 30 years of writing machine code for it. We may all gush about the performance of the latest dual- and quad-core x86-64 whatevers (my dual Opteron-powered Linux server is smokin' fast), but the modest eight bit 6502 just keeps on cranking.

_________________
x86? We ain't got no x86. We don't NEED no stinking x86!

Top

kc5tja

Post subject:

Posted: Thu Dec 24, 2009 10:17 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

I think asynchronous CPU designs will be the wave of the future, once people really wake up about it. By definition, they always run as fast as possible, consume the least amount of power doing so (no need for clock distribution trees, DLLs, and other power-hogging circuits; plus, functional units left unused or unaccessed naturally don't consume power, as no transistors are switching states!), and actually permit greater logic densities because of the space freed up from not having to route clocks all over hell and creation.

The problem with asynchronous designs, however, is their speed varies somewhat with temperature. I can say with utmost certainty that the Intellasys multi-core Forth chips run approximately 450MHz to 700MHz equivalent performance ranges, per core. But, on any particular run, I cannot tell you the precise equivalent clocking speed. (Since we're dealing with a dual-stack CPU architecture, one clock per instruction is assumed.) It also, obviously, depends on the precise instruction mix too.

Nonetheless, there are techniques one can use to mitigate this. First and foremost, like high bandwidth networks, you can rely extensively on the fact that "speed is cheap," and as long as your CPU processes data faster than the external circuitry can keep up, it doesn't matter what your CPU's precise instruction rate actually is. It all looks the same from the external, perhaps even synchronous, peripheral's point of view. I implemented a VGA display using one of Intellasys chips, relying on the monitor's built-in PLL to maintain frame synchronization, and it worked extremely well. Changes in temperature between my home and the SVFIG meeting room accounted for only a handful of percentage points discrepency in ideal frequency. Only the first two or three scanlines showed significant distortion. Other distortions required you to know exactly what to look for, and at close viewing distances.

Indeed, consider the more common case that affects everyone on the planet today: that in between all the pipeline stages in contemporary CPUs, the logic is most assuredly purely combinatorial -- in other words, asynchronous.

And speaking of pipelines, another huge challenge for asynchronous logic these days are the ultra-deep pipelines we're all used to. You cannot have deep pipes with asynchronous logic, because there's really no benefit in having it. Assuming you're using Muller C-elements for inter-stage synchrony, you're almost certainly going to end up with 50% utilization, since a circuit needs to relax before taking on another chunk of data (BTW, a similar phenomenon explains an over-stimulated neuron will develop a tolerance to whatever stimulus you provide it). With utilization that low, you might as well halve the number of pipeline stages and introducing more combinatorial logic -- it'll be faster anyway (lesser time spent in waiting for propagation delays on the C-elements).

Shallow pipes, however, will definitely come into their own again -- the old 4-stage pipeline (fetch, decode, execute, write-back) makes a lot of sense in an asynchronous design, assuming you don't adopt a MISC architecture. Each stage really does conform to independent, logically-isolated sub-functions of a CPU, which is where asynchronous logic really shines. (Note: the 6502 supports a crude 2-stage pipeline internally; however, its pipeline doesn't work at the instruction level, but rather at the individual cycle level; that is, for any given cycle N, it has already decoded what to do for N+1. This is why 2 byte, 3 byte, and 4 byte instructions often take 2, 3, and 4 cycles to complete, while 1 byte instructions still, unfortunately, take 2.)

I would love to see what performance figures an asynchronous 6502 gets. It won't be anywhere near the performance of the Intellasys chips though (or, now-a-days, Green Arrays, as Chuck Moore has left Intellasys over various business-related issues).

Top

fachat

Post subject:

Posted: Tue Jan 12, 2010 8:42 pm

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany

Thank you for an amazing link (I can read German after all :-)

Their block diagram was reverse engineered, they claim they did not have access to an original block diagram - neither to an original CPU decoder table.

They also say that the asynchronous version of the 6502 used too many CLBs in the FPGA, so they could not "add translation drivers from conventional to flank-triggered dual-rail logic, and therefore no benchmark between synchronous and asynchronous variant is available" (my own crude translation).

Also they write that "For reasons of [FPGA] space the much more interesting Z80 was out of the question". ... "With [the 6502's] 56 opcodes and a maximum opcode length of three bytes its implementation just barely fits on the FPGA used here".

"The translation of the 6502 into an asynchronous version first resulted in much too large processors, therefore optimization concentrated mainly on space and not on performance."

They say they have combined the small scale of the "level-controlled dual-rail" and the speed of the "flank-triggered dual-rail". Whatever that means in asynchronous logic.

"The asynchronous 6502 used 3091 CLBs, i.e. only about three times the number of CLBs of the synchronous version." ... "It has to be said that the optimization settings were different in both cases: the synchronous 6502 was compiled with much less optimization, to keep compile time low. The asynchronous version was compiled with maximum optimization to keep it small".

Futher they compare different 6502 implementations:

"The synchronous 6502, together with the PCI wrapper and PCI interface used 1107 CLBs. Another compiled 6502, implemented with microcode and barely optimized, from the freeip project, used 1118 CLBs [edit, the following was missing: even without the PCI wrapper]. This comparison shows that with the same function, but a more elaborate VHDL-description, 5% can be saved."

They conclude: "Unfortunately a speed comparison could not be performed due to the given reasons. As first simulator runs show however, some execution steps run much further than the maximum to be used clock frequency. But quickly the main problem of asynchronous CPUs showed: there are only much too short execution steps and too many accesses to external components (mostly RAM). Even though with the "DEP-Format" [obviously their technique] a technical way to increase RAM access speed (at the expense of memory efficiency [Speicherausnutzung]), a CPU (the size of a 6502) in asynchronous logic in a standard environment, can barely achieve speed increase. With the current processors with considerably more complex internal logic and longer computations between memory accesses the situation is surely different."

André

Top

ElEctric_EyE

Post subject:

Posted: Tue Jan 12, 2010 8:52 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

What IC did they use for the CPLD/FPGA?

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502

Top

BigEd

Post subject:

Posted: Tue Jan 12, 2010 9:25 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Thanks for the extra translated snippets! It's interesting that they were able to construct a smaller 6502 than the free-ip one. (Anyway, FPGAs are bigger now)

(I hadn't realised at the time, but this PSALM project was already catalogued in http://6502.org/homebuilt/)

(I suspect flank-triggered must be edge-triggered)

The point about memory accesses dominating the speed is interesting. Of course, there will always be a bottleneck somewhere. The Amulet (Asychronous ARM) projectseems to anticipate the same issue.

Cheers
Ed

Top

BigEd

Post subject:

Posted: Tue Jan 12, 2010 9:29 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

ElEctric_EyE wrote:

What IC did they use for the CPLD/FPGA?

Google translate says:

Quote:

On the hardware is based on our project seminar Psalm PCI cards Virtual Computer Corporation, Reseda, which are equipped with FPGAs of Xilinx, Inc..

Were available at two different cards, of which the capacity of the smaller manufactures a XC4062XLT, the larger with a XC4085XLA FPGA
is fitted.

Other constituents of the cards have memory in the form of 4MB SRAM, 2MB Configuration Flash and 512KB cache RAM, two expansion connectors and a the programmable clock generator that supports frequencies from 360kHz to 100MHz

Top

kc5tja

Post subject:

Posted: Tue Jan 12, 2010 9:34 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

fachat wrote:

They also say that the asynchronous version of the 6502 used too many CLBs in the FPGA, so they could not "add translation drivers from conventional to flank-triggered dual-rail logic, and therefore no benchmark between synchronous and asynchronous variant is available" (my own crude translation).

Hogwash. Write a program. Load it onto a synchronous 6502 design. Observe the rate at which _SYNC is asserted per unit of wall-clock time. That is your CPU's instruction rate.

Do the same with the asynchronous implementation (I'm sure that, in the absence of _SYNC, they had some alternative signal they could have monitored), for the same amount of wall-time.

Divide one by the other to get a ratio of performance differences.

HOWEVER, . . .

Quote:

Also they write that "For reasons of [FPGA] space the much more interesting Z80 was out of the question".

Who would want to make an asynchronous Z80? If you're going to go through that kind of effort, you might as well go with an 8086, which is not too far from the Z80 capability-wise, and I claim is a more powerful processor anyway. (OTOH, the Z80 has the advantage of EX AF,AF' and EX R,R' instructions for super-fast interrupt response.)

Quote:

They say they have combined the small scale of the "level-controlled dual-rail" and the speed of the "flank-triggered dual-rail". Whatever that means in asynchronous logic.

My interpretation of this is that they combined the small-size permitted by a purely combinatorial circuit (which, by definition, is level-triggered; I haven't seen an edge-triggered NAND gate recently.

) with the high speeds permitted by dual (rising and falling) edge-triggered circuits. This is possible because you no longer have to worry about clock skew and distribution trees throughout the whole circuit.

Quote:

"But quickly the main problem of asynchronous CPUs showed: there are only much too short execution steps and too many accesses to external components (mostly RAM). Even though with the "DEP-Format" [obviously their technique] a technical way to increase RAM access speed (at the expense of memory efficiency [Speicherausnutzung]), a CPU (the size of a 6502) in asynchronous logic in a standard environment, can barely achieve speed increase. With the current processors with considerably more complex internal logic and longer computations between memory accesses the situation is surely different."

The problem I have with this conclusion is that it's wrong, on several levels.

(1) An asynchronous CPU, by definition, cannot run in "a standard environment," for the latter's performance is always dictated by the clock. You need an asynchronous means of accessing RAM, for starters (a quadrature interlocking protocol works great for this, and I claim is much simpler than the 68000-style AS/DTACK approach). Then, knowing that long lines to outside resources will incur a major performance hit, you MUST cache. In fact, even with a synchronous design, caching starts to look awfully appealing at CPU clock speeds in excess of 8MHz (which requires memory capable of keeping up with a 16MHz clock, since you're hitting it only during half a clock period). This is why most high-speed 65xx designs do NOT expose their buses through expansion ports.

(2) The conclusion exposes the author's complete misunderstanding of the purpose for using asynchronous logic. Even if you use an asynchronous CPU in an otherwise synchronous environment, you get the benefit of a chip which draws significantly less power (if implemented correctly; I'm not sure this would work in an FPGA environment), is much quieter in terms of RF hash, and which produces substantially less heat. These all can be extremely valuable qualities. For example, on an Intellasys SEAforth 24A chip that I have, I had all 24 cores running full blast at an estimated 650 MIPS each, and that chip only became two degrees warmer. ONLY two degrees! And, if my ham rig is any indication, it had no observable hash emissions; everything it put out was well below nature's own noise floor. Contrast this against my desktop PC, whose emissions are so strong I can hear it in my headphones when the sound card is silent.

(3) The conclusion leads the reader to directly believe that asynchronous logic is a worthless persuit. As you've translated it, and assuming I haven't had prior experience with asynchronous designs, I would never want to consider asynchronous logic ever again after reading it. Conclusions should recite the constraints of their conclusions, which is clearly not done here, with the exception of disclosing how they're interfacing to RAM, both of which appear to be synchronous based on context provided in your translation.

My conclusions could well be wrong, though, but since I don't read German, there's no way I can really respond from reading the primary source.

Top

BigEd

Post subject:

Posted: Tue Jan 12, 2010 10:28 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Seems a bit hostile!

I would have thought the main point of the exercise is finding a way to apply asynchronous techniques in an FPGA, demonstrate it, and find some of the limitations.

I think it is a diploma-level thesis, which was successful in showing originality and competence. It's almost inevitable that time will run out before every possible lesson can be drawn out and documented.

We, on the other hand, can run our projects for years!

Top

bogax

Post subject:

Posted: Tue Jan 12, 2010 11:27 pm

Joined: Tue Nov 18, 2003 8:41 pm
Posts: 250

First a caveat: Take this with a grain of salt, cause I was never
conversent on the subject and it's been years and years.

kc5tja wrote:

Quote:

They say they have combined the small scale of the "level-controlled dual-rail" and the speed of the "flank-triggered dual-rail". Whatever that means in asynchronous logic.

Well, sort of

Wikipedia says a dual rail logic system has seperate lines for
1 and 0 which is sort of true as far as it goes.

Way back in the before times, when Large Scale Integration meant
something like having a whole flip flop on a single board (say,
late '50s, early '60s) they used dual rail logic systems.

Imagine an asynchronous shift register built from set/reset latches.

Now imagine your whole system, all the combinatorial logic, worked like that.

But it also added redundancy sort of like using a differential line
or a parity bit.

How (or if) that fits into a modern FPGA I don't know (they may be
talking about something else, a later evolution perhaps)

Also there were 2 and 4 clock systems (whIch I think they still use)
that could work in a similar fashion that basically had seperate lines
for the pull ups and the pull downs (so they were never both on at
the same time).

Top

ElEctric_EyE

Post subject:

Posted: Wed Jan 13, 2010 1:41 pm

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA

BigEd wrote:

ElEctric_EyE wrote:

What IC did they use for the CPLD/FPGA?

Google translate says:

Quote:

On the hardware is based on our project seminar Psalm PCI cards Virtual Computer Corporation, Reseda, which are equipped with FPGAs of Xilinx, Inc..

Were available at two different cards, of which the capacity of the smaller manufactures a XC4062XLT, the larger with a XC4085XLA FPGA
is fitted.

Other constituents of the cards have memory in the form of 4MB SRAM, 2MB Configuration Flash and 512KB cache RAM, two expansion connectors and a the programmable clock generator that supports frequencies from 360kHz to 100MHz

I'm curious what the cost of hardware for developing a 6502 on fpga...
The Xylinx XC4085 has been discontinued but, as a reference, it has 85,000 gates and 448 I/O: http://www.xilinx.com/support/documenta ... /ds015.pdf

The Spartan 6 line from Xylinx, specifically the XC6SLX100 has 101,000 gates and 480 I/O: http://www.xilinx.com/publications/prod ... &width=600

Digikey has a XC6LX45 development kit, only 44,000 gates and 358 I/O, for about $550. The IC itself is $54 NIS. The X6SLX75, /100, /150 versions don't show up at all.

Mouser doesn't distribute Xylinx... There are other companies like Altera, Xylinx's main competitor according to wikipedia, but you have to "rent" their software. Lattice has SRAM based FPGA's so it's difficult to compare "gates".

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502

Top

BigEd

Post subject:

Posted: Wed Jan 13, 2010 1:54 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

See a previous discussion - both OHO in Germany and Craignell in the UK have a range of 5v-compatible DIP-mounted FPGA products. I have, but have not yet made much use of, two of OHO's products: a 40-pin one with no onboard RAM chip which is suitable for a 6502 cpu or similar (with hugely more capacity than needed for that, including some 10k bytes of storage) and a 24-pin one with a 512kbyte RAM which is more suitable for a 6502 system.

All these products are rather less than $100. They are all xilinx, and the tools are zero-cost and cross-platform.

I have some ideas about configuring the 24-pin part as a RAM or ROM - can actually make it 28-pin - and posting I/O through the "memory" to a co-processor inside, which could be 6502 or similar, or something completely different. I very much like the idea of a 60MHz 6502 inside an Oric.

Top

kc5tja

Post subject:

Posted: Wed Jan 13, 2010 4:33 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

Xess sells a Xilinx Spartan 3-based development board, with what appears to be a one million gate FPGA, for $199:

http://www.xess.com/prods/prod035.php

I was thinking about acquiring one of these boards to play with some time ago. 1 million gates is a lot of gates to have fun with.

Top

BitWise

Post subject:

Posted: Wed Jan 13, 2010 4:40 pm

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

kc5tja wrote:

Yep. Its a nice bit of kit. I bought one a few years back.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs

Top

OwenS

Post subject:

Posted: Wed Jan 13, 2010 6:01 pm

Joined: Thu Jul 26, 2007 4:46 pm
Posts: 105

kc5tja wrote:

(2) The conclusion exposes the author's complete misunderstanding of the purpose for using asynchronous logic. Even if you use an asynchronous CPU in an otherwise synchronous environment, you get the benefit of a chip which draws significantly less power (if implemented correctly; I'm not sure this would work in an FPGA environment), is much quieter in terms of RF hash, and which produces substantially less heat. These all can be extremely valuable qualities. For example, on an Intellasys SEAforth 24A chip that I have, I had all 24 cores running full blast at an estimated 650 MIPS each, and that chip only became two degrees warmer. ONLY two degrees! And, if my ham rig is any indication, it had no observable hash emissions; everything it put out was well below nature's own noise floor. Contrast this against my desktop PC, whose emissions are so strong I can hear it in my headphones when the sound card is silent.

With desktop class IC processes, an asynchronous processor consumes far more power than a synchronous one. Why? Because leakage current is higher than switching current. Modern processors use FETs with threshhold voltages of 200mV, and 1V power supplies. Therefore, power consumption is linear with powered transistors; clock frequency is not a major factor. Asynchronous designs, as we have established, use more transistors.

The biggest gains with things like Intel SpeedStep and AMD Cool 'n Quiet come from the fact that the processor's Vcore is reduced, not the frequency reduction; reducing the frequency just allows that.

Oh, and the noise? Firstly, it's indicative of a badly designed sound card (One which is in need of better supply filtering), and it's mostly coming from the switching regulator which powers the processor. The switching regulators which power the high drain devices in modern computers (CPU, GPU) all take their power from the 12V rails; if you look at them, the ripple is truly quite atrocious, but it doesn't matter, because nothing is powered directly off it them anyway.

Top

Page 1 of 2

[ 26 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Hardware

All times are UTC

Who is online

Users browsing this forum: No registered users and 40 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum